The NKJP_1M corpus
The NKJP_1M corpus is a manually annotated one million word subcorpus of the National Corpus of Polish (NKJP – Narodowy Korpus Języka Polskiego), composed of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. The corpus features lemmatisation, morphological annotation, and representative coverage of text categories.
Name | NKJP_1M | |
---|---|---|
Positions | Number of positions (tokens) | 1,215,513 |
Number of positions (excl. punctuation) | 992,014 | |
Number of word forms | 143,477 | |
Number of lemmas | 54,174 | |
Structures | Number of documents <doc> | 3,889 |
Number of paragraphs <p> | 18,484 | |
Number of sentences <s> | 85,663 | |
Further information | Reference corpus | YES |
Representative corpus | YES | |
Publication year | 2018 |
Text classification
The classification of NKJP_1M texts combines traditional criteria with a categorisation based on topic and genre. Genre categories (or types, in Polish corpus terminology) are often integrated with medium categories (or communication channels, in Polish corpus terminology), but formally, these two types of categorisation remain separate.
Communication layer | doc.genre | Category | Proportion |
---|---|---|---|
written | #typ_publ | journalism | 48.85% |
#typ_lit | fiction | 17.04% | |
#typ_fakt | non-fiction | 5.34% | |
#typ_inf-por | informative texts | 5.62% | |
#typ_urzed | legal texts | 2.97% | |
#typ_nd | popular science texts | 1.91% | |
#typ_nklas | non-fiction unclassified book | 1.00% | |
#typ_listy | correspondence | 0.04% | |
#typ_lit_poezja | poetry | 0.01% | |
spoken | #typ_qmow | quasi-spoken texts | 2.50% |
#typ_media | spoken media text | 2.07% | |
#typ_konwers | spoken conversational texts | 5.57% | |
web | #typ_net_interakt | interaction-based Internet texts | 5.18% |
#typ_net_nieinterakt | non-interaction-based Internet texts | 1.91% |
Positional annotation and tagging
Compared to typical corpora of Czech, NKJP_1M additionally has a positional attribute which is specific for Polish, the so-called flexeme. It is a category which further subdivides parts of speech into more specific lexeme classes. For instance, within nouns (subst), depreciative nouns (depr) form one of the flexeme subgroups; flexemes also distinguish between regular adjectives (adj), the first part of compound adjectives (adja, e.g. biało-czerwony, sportowo-rekreacyjny), post-prepositional adjectives (adjp, e.g. po polsku, od dawna), and predicative adjectives (adjc, e.g. jestem pewien, był wesół i zdrów); and there is a particularly fine-grained subcategorization of verbs (more than 10 different flexemes).
Moreover, the Polish tagset differs from the Czech one; its detailed description (including the full flexeme list) is available here.
In addtion, two positional attributes were added to the original corpus: lc
and lemma_lc
, which allow to search the corpus in a case-insensitive manner.
How to cite NKJP_1M
Przepiórkowski,. A. – Degórski, Ł. – Murzynowski, G. – Szałkiewicz, Ł. – Czelakowska, A.– Savary, A. – Głowińska, K.: NKJP_1M: ręcznie znakowany milionowy podkorpus NKJP. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: http://www.korpus.cz
Degórski, Ł. – Przepiórkowski, A. (2012): Ręcznie znakowany milionowy podkorpus NKJP. In: A. Przepiórkowski – M. Bańko – R. L. Górski – B. Lewandowska-Tomaszczyk (eds), Narodowy Korpus Języka Polskiego, pp. 51–58. Warszawa: Wydawnictwo Naukowe PWN. ISBN 978-83-01-16700-4.
– Adrian Zasina, Michal Škrabal