This is an old revision of the document!
Corpus NKJP_1M
NKJP_1M corpus is a manually annotated one-milion word subcorpus of the National Corpus of Polish (NKJP – Narodowy Korpus Języka Polskiego) compiled of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. Corpus has lemmatisation and morphological annotation; bearing in mind text categorisation is representative.
Name | NKJP_1M | |
---|---|---|
Positions | Number of positions (tokens) | 1 215 513 |
Number of positions (excl. punctuation) | 992 014 | |
Number of word forms | 143 477 | |
Number of lemmas | 54 174 | |
Structures | Number of documents <doc> | 3 889 |
Number of paragraphs <p> | 18 484 | |
Number of sentences <s> | 85 663 | |
Further information | Reference corpus | ANO |
Representative corpus | ANO | |
Publication year | 2018 |
Text classification
Text classification of NKJP_1M combines traditional and thematic-genre text categorisation. Text categorisation into genre (in Polish corpus terminology rather type) is often integrated with the medium (in Polish corpus terminology rather communication channel) categorisation, nevertheless these two type of categorisation are separate.
Communication layer | doc.genre | Category | Proportion |
---|---|---|---|
written | #typ_publ | journalism | 48,85 % |
#typ_lit | fiction | 17,04 % | |
#typ_fakt | non-fiction | 5,34 % | |
#typ_inf-por | informative type | 5,62 % | |
#typ_urzed | legal texts | 2,97 % | |
#typ_nd | scientific and teaching texts | 1,91 % | |
#typ_nklas | non-fiction unclassified book | 1,00 % | |
#typ_listy | correspondence | 0,04 % | |
#typ_lit_poezja | poetry | 0,01 % | |
spoken | #typ_qmow | quasi-spoken texts | 2,50 % |
#typ_media | spoken media text | 2,07 % | |
#typ_konwers | spoken conversational texts | 5,57 % | |
web | #typ_net_interakt | dynamic Internet texts | 5,18 % |
#typ_net_nieinterakt | static Internet texts | 1,91 % |
Positional annotation and tagging
Compared to the Czech corpora NKJP_1M has in addition a positional attribute which is specific for Polish, so called flexeme. It is a category based on part of speeches that are further divided into more specific lexeme classes. Thus, there are for example within nouns (subst) differed depreciative nouns (depr), beside of common adjectives (adj) there are ad-adjectival adjectives (adja, e.g. biało-czerwony, sportowo-rekreacyjny), post-prepositional adjectives (adjp, e.g. po polsku, od dawna), predicative adjectives (adjc, e.g. jestem pewien, był wesół i zdrów), particulary is delicate the distinction of verbal categories (more than 10 different flexemes).
Moreover, the Polish tagset differs from the Czech one; its detailed description (including the list of all flexemes) is available here.
In addtion, two positional attributes were added to the original corpus: lc
and lemma_lc
, which allow to search regardless of case-sensitive in the corpus.
How to cite NKJP_1M
Przepiórkowski,. A. – Degórski, Ł. – Murzynowski, G. – Szałkiewicz, Ł. – Czelakowska, A.– Savary, A. – Głowińska, K.: NKJP_1M: ręcznie znakowany milionowy podkorpus NKJP. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: http://www.korpus.cz
Degórski, Ł. – Przepiórkowski, A. (2012): Ręcznie znakowany milionowy podkorpus NKJP. In: A. Przepiórkowski – M. Bańko – R. L. Górski – B. Lewandowska-Tomaszczyk (eds), Narodowy Korpus Języka Polskiego, pp. 51–58. Warszawa: Wydawnictwo Naukowe PWN. ISBN 978-83-01-16700-4.
– Adrian Zasina, Michal Škrabal