Next revision | Previous revision |
en:cnk:nkjp [2018/11/02 13:05] – created adrianzasina | en:cnk:nkjp [2018/11/12 16:09] (current) – [Corpus NKJP_1M] michalkren |
---|
~~NOTOC~~ | ~~NOTOC~~ |
====== Corpus NKJP_1M ====== | ====== The NKJP_1M corpus ====== |
| |
NKJP_1M corpus is a manually annotated one-milion word subcorpus of the [[http://nkjp.pl| National Corpus of Polish]] (NKJP – //Narodowy Korpus Języka Polskiego//) compiled of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. Corpus has lemmatisation and morphological annotation; bearing in mind text categorisation is representative. | The NKJP_1M corpus is a manually annotated one million word subcorpus of the [[http://nkjp.pl| National Corpus of Polish]] (NKJP – //Narodowy Korpus Języka Polskiego//), composed of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. The corpus features lemmatisation, morphological annotation, and representative coverage of text categories. |
| |
<WRAP right 35%> | <WRAP right 35%> |
^ <fs medium>Name</fs> ^^ <fs medium>NKJP_1M</fs> ^ | ^ <fs medium>Name</fs> ^^ <fs medium>NKJP_1M</fs> ^ |
^ Positions ^ Number of positions (tokens) | 1 215 513 | | ^ Positions ^ Number of positions (tokens) | 1,215,513 | |
^ ::: ^ Number of positions (excl. punctuation) | 992 014 | | ^ ::: ^ Number of positions (excl. punctuation) | 992,014 | |
^ ::: ^ Number of word forms | 143 477 | | ^ ::: ^ Number of word forms | 143,477 | |
^ ::: ^ Number of lemmas | 54 174 | | ^ ::: ^ Number of lemmas | 54,174 | |
^ Structures ^ Number of documents <doc> | 3 889 | | ^ Structures ^ Number of documents <doc> | 3,889 | |
^ ::: ^ Number of paragraphs <p> | 18 484 | | ^ ::: ^ Number of paragraphs <p> | 18,484 | |
^ ::: ^ Number of sentences <s> | 85 663 | | ^ ::: ^ Number of sentences <s> | 85,663 | |
^ Further information ^ Reference corpus | ANO | | ^ Further information ^ Reference corpus | YES | |
^ ::: ^ Representative corpus | ANO | | ^ ::: ^ Representative corpus | YES | |
^ ::: ^ Publication year | 2018 | | ^ ::: ^ Publication year | 2018 | |
</WRAP> | </WRAP> |
| |
===== Text classification ===== | ===== Text classification ===== |
Text classification of NKJP_1M combines traditional and thematic-genre text categorisation. Text categorisation into genre (in Polish corpus terminology rather //type//) is often integrated with the medium (in Polish corpus terminology rather //communication channel//) categorisation, nevertheless these two type of categorisation are separate. | |
| The classification of NKJP_1M texts combines traditional criteria with a categorisation based on topic and genre. Genre categories (or //types//, in Polish corpus terminology) are often integrated with medium categories (or //communication channels//, in Polish corpus terminology), but formally, these two types of categorisation remain separate. |
^Communication layer^ doc.genre ^ Category ^ Proportion ^ | ^Communication layer^ doc.genre ^ Category ^ Proportion ^ |
| written | #typ_publ | journalism | 48,85 %| | | written | #typ_publ | journalism | 48.85%| |
| ::: | #typ_lit | fiction | 17,04 %| | | ::: | #typ_lit | fiction | 17.04%| |
| ::: | #typ_fakt | non-fiction | 5,34 %| | | ::: | #typ_fakt | non-fiction | 5.34%| |
| ::: | #typ_inf-por | informative type | 5,62 %| | | ::: | #typ_inf-por | informative texts | 5.62%| |
| ::: | #typ_urzed | legal texts | 2,97 %| | | ::: | #typ_urzed | legal texts | 2.97%| |
| ::: | #typ_nd | scientific and teaching texts | 1,91 %| | | ::: | #typ_nd | popular science texts | 1.91%| |
| ::: | #typ_nklas | non-fiction unclassified book | 1,00 %| | | ::: | #typ_nklas | non-fiction unclassified book | 1.00%| |
| ::: | #typ_listy | correspondence| 0,04 %| | | ::: | #typ_listy | correspondence| 0.04%| |
| ::: | #typ_lit_poezja | poetry | 0,01 %| | | ::: | #typ_lit_poezja | poetry | 0.01%| |
| spoken | #typ_qmow | quasi-spoken texts | 2,50 %| | | spoken | #typ_qmow | quasi-spoken texts | 2.50%| |
| ::: | #typ_media | spoken media text | 2,07 %| | | ::: | #typ_media | spoken media text | 2.07%| |
| ::: | #typ_konwers | spoken conversational texts | 5,57 %| | | ::: | #typ_konwers | spoken conversational texts | 5.57%| |
| web | #typ_net_interakt | dynamic Internet texts | 5,18 %| | | web | #typ_net_interakt | interaction-based Internet texts | 5.18%| |
| ::: | #typ_net_nieinterakt | static Internet texts | 1,91 %| | | ::: | #typ_net_nieinterakt | non-interaction-based Internet texts | 1.91%| |
| |
===== Positional annotation and tagging ===== | ===== Positional annotation and tagging ===== |
| |
Compared to the Czech corpora NKJP_1M has in addition a positional attribute which is specific for Polish, so called **flexeme**. It is a category based on part of speeches that are further divided into more specific lexeme classes. Thus, there are for example within nouns (//subst//) differed depreciative nouns (//depr//), beside of common adjectives (//adj//) there are ad-adjectival adjectives (//adja//, e.g. //__biało__-czerwony//, //__sportowo__-rekreacyjny//), post-prepositional adjectives (//adjp//, e.g. //po __polsku__//, //od __dawna__//), predicative adjectives (//adjc//, e.g. //jestem __pewien__//, //był __wesół__ i __zdrów__//), particulary is delicate the distinction of verbal categories (more than 10 different flexemes). | Compared to typical corpora of Czech, NKJP_1M additionally has a positional attribute which is specific for Polish, the so-called **flexeme**. It is a category which further subdivides parts of speech into more specific lexeme classes. For instance, within nouns (//subst//), depreciative nouns (//depr//) form one of the flexeme subgroups; flexemes also distinguish between regular adjectives (//adj//), the first part of compound adjectives (//adja//, e.g. //__biało__-czerwony//, //__sportowo__-rekreacyjny//), post-prepositional adjectives (//adjp//, e.g. //po __polsku__//, //od __dawna__//), and predicative adjectives (//adjc//, e.g. //jestem __pewien__//, //był __wesół__ i __zdrów__//); and there is a particularly fine-grained subcategorization of verbs (more than 10 different flexemes). |
| |
Moreover, the Polish tagset differs from the Czech one; its detailed description (including the list of all flexemes) is available [[http://nkjp.pl/poliqarp/help/ense2.html|here]]. | Moreover, the Polish tagset differs from the Czech one; its detailed description (including the full flexeme list) is available [[http://nkjp.pl/poliqarp/help/ense2.html|here]]. |
| |
In addtion, two positional attributes were added to the original corpus: ''lc'' and ''lemma_lc'', which allow to search regardless of case-sensitive in the corpus. | In addtion, two positional attributes were added to the original corpus: ''lc'' and ''lemma_lc'', which allow to search the corpus in a case-insensitive manner. |
| |
====== How to cite NKJP_1M ====== | ====== How to cite NKJP_1M ====== |