~~NOTOC~~
====== The NKJP_1M corpus ======

The NKJP_1M corpus is a manually annotated one million word subcorpus of the [[http://nkjp.pl| National Corpus of Polish]] (NKJP – //Narodowy Korpus Języka Polskiego//), composed of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. The corpus features lemmatisation, morphological annotation, and representative coverage of text categories.

<WRAP right 35%>
^ <fs medium>Name</fs> ^^ <fs medium>NKJP_1M</fs> ^
^ Positions ^ Number of positions (tokens) |  1,215,513 |  
^ ::: ^ Number of positions (excl. punctuation) |  992,014 |  
^ ::: ^ Number of word forms |  143,477 |  
^ ::: ^ Number of lemmas |  54,174 |
^ Structures ^ Number of documents <doc> |  3,889 |
^ ::: ^ Number of paragraphs <p> |  18,484 |
^ ::: ^ Number of sentences <s> |  85,663 |
^ Further information ^ Reference corpus |  YES |  
^ ::: ^ Representative corpus |  YES |
^ ::: ^ Publication year |  2018 |
</WRAP>

===== Text classification =====

The classification of NKJP_1M texts combines traditional criteria with a categorisation based on topic and genre. Genre categories (or //types//, in Polish corpus terminology) are often integrated with medium categories (or //communication channels//, in Polish corpus terminology), but formally, these two types of categorisation remain separate.

^Communication layer^ doc.genre ^ Category ^ Proportion ^
| written | #typ_publ | journalism |  48.85%|
| ::: | #typ_lit | fiction |  17.04%|
| ::: | #typ_fakt | non-fiction |  5.34%|
| ::: | #typ_inf-por | informative texts |  5.62%|
| ::: | #typ_urzed | legal texts |  2.97%|
| ::: | #typ_nd | popular science texts |  1.91%|
| ::: | #typ_nklas | non-fiction unclassified book |  1.00%|
| ::: | #typ_listy | correspondence|  0.04%|
| ::: | #typ_lit_poezja | poetry |  0.01%|
| spoken | #typ_qmow | quasi-spoken texts |  2.50%|
| ::: | #typ_media | spoken media text |  2.07%|
| ::: | #typ_konwers | spoken conversational texts |  5.57%|
| web | #typ_net_interakt | interaction-based Internet texts |  5.18%|
| ::: | #typ_net_nieinterakt | non-interaction-based Internet texts |  1.91%|

===== Positional annotation and tagging =====

Compared to typical corpora of Czech, NKJP_1M additionally has a positional attribute which is specific for Polish, the so-called **flexeme**. It is a category which further subdivides parts of speech into more specific lexeme classes. For instance, within nouns (//subst//), depreciative nouns (//depr//) form one of the flexeme subgroups; flexemes also distinguish between regular adjectives (//adj//), the first part of compound adjectives (//adja//, e.g. //__biało__-czerwony//, //__sportowo__-rekreacyjny//), post-prepositional adjectives (//adjp//, e.g. //po __polsku__//, //od __dawna__//), and predicative adjectives (//adjc//, e.g. //jestem __pewien__//, //był __wesół__ i __zdrów__//); and there is a particularly fine-grained subcategorization of verbs (more than 10 different flexemes). 

Moreover, the Polish tagset differs from the Czech one; its detailed description (including the full flexeme list) is available [[http://nkjp.pl/poliqarp/help/ense2.html|here]].

In addtion, two positional attributes were added to the original corpus: ''lc'' and ''lemma_lc'', which allow to search the corpus in a case-insensitive manner.

====== How to cite NKJP_1M ======

<WRAP round tip 70%>
Przepiórkowski,. A. – Degórski, Ł. – Murzynowski, G. – Szałkiewicz, Ł. – Czelakowska, A.– Savary, A. – Głowińska, K.: //NKJP_1M: ręcznie znakowany milionowy podkorpus NKJP//. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: http://www.korpus.cz

Degórski, Ł. – Przepiórkowski, A. (2012): Ręcznie znakowany milionowy podkorpus NKJP. In: A. Przepiórkowski – M. Bańko – R. L. Górski – B. Lewandowska-Tomaszczyk (eds), //[[http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf|Narodowy Korpus Języka Polskiego]]//, pp. 51–58. Warszawa: Wydawnictwo Naukowe PWN. ISBN 978-83-01-16700-4.
</WRAP>

//-- Adrian Zasina, Michal Škrabal//