Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:cnk:nkjp [2018/11/02 18:19]
David Lukeš jazyková korektura
en:cnk:nkjp [2018/11/12 16:09] (current)
Michal Křen [Corpus NKJP_1M]
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
-====== ​Corpus ​NKJP_1M ======+====== ​The NKJP_1M ​corpus ​======
  
 The NKJP_1M corpus is a manually annotated one million word subcorpus of the [[http://​nkjp.pl| National Corpus of Polish]] (NKJP – //Narodowy Korpus Języka Polskiego//​),​ composed of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. The corpus features lemmatisation,​ morphological annotation, and representative coverage of text categories. The NKJP_1M corpus is a manually annotated one million word subcorpus of the [[http://​nkjp.pl| National Corpus of Polish]] (NKJP – //Narodowy Korpus Języka Polskiego//​),​ composed of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. The corpus features lemmatisation,​ morphological annotation, and representative coverage of text categories.
Line 6: Line 6:
 <WRAP right 35%> <WRAP right 35%>
 ^ <fs medium>​Name</​fs>​ ^^ <fs medium>​NKJP_1M</​fs>​ ^ ^ <fs medium>​Name</​fs>​ ^^ <fs medium>​NKJP_1M</​fs>​ ^
-^ Positions ^ Number of positions (tokens) |  1 215 513 |   +^ Positions ^ Number of positions (tokens) |  1,215,513 |   
-^ ::: ^ Number of positions (excl. punctuation) |  992 014 |   +^ ::: ^ Number of positions (excl. punctuation) |  992,014 |   
-^ ::: ^ Number of word forms |  143 477 |   +^ ::: ^ Number of word forms |  143,477 |   
-^ ::: ^ Number of lemmas |  54 174 | +^ ::: ^ Number of lemmas |  54,174 | 
-^ Structures ^ Number of documents <doc> |  3 889 | +^ Structures ^ Number of documents <doc> |  3,889 | 
-^ ::: ^ Number of paragraphs <p> |  18 484 | +^ ::: ^ Number of paragraphs <p> |  18,484 | 
-^ ::: ^ Number of sentences <s> |  85 663 |+^ ::: ^ Number of sentences <s> |  85,663 |
 ^ Further information ^ Reference corpus |  YES |  ​ ^ Further information ^ Reference corpus |  YES |  ​
 ^ ::: ^ Representative corpus |  YES | ^ ::: ^ Representative corpus |  YES |
Line 23: Line 23:
  
 ^Communication layer^ doc.genre ^ Category ^ Proportion ^ ^Communication layer^ doc.genre ^ Category ^ Proportion ^
-| written | #typ_publ | journalism |  48,85 %| +| written | #typ_publ | journalism |  48.85%| 
-| ::: | #typ_lit | fiction |  17,04 %| +| ::: | #typ_lit | fiction |  17.04%| 
-| ::: | #typ_fakt | non-fiction |  5,34 %| +| ::: | #typ_fakt | non-fiction |  5.34%| 
-| ::: | #​typ_inf-por | informative texts |  5,62 %| +| ::: | #​typ_inf-por | informative texts |  5.62%| 
-| ::: | #typ_urzed | legal texts |  2,97 %| +| ::: | #typ_urzed | legal texts |  2.97%| 
-| ::: | #typ_nd | popular science texts |  1,91 %| +| ::: | #typ_nd | popular science texts |  1.91%| 
-| ::: | #typ_nklas | non-fiction |  1,00 %| +| ::: | #typ_nklas | non-fiction ​unclassified book |  1.00%| 
-| ::: | #typ_listy | correspondence| ​ 0,04 %| +| ::: | #typ_listy | correspondence| ​ 0.04%| 
-| ::: | #​typ_lit_poezja | poetry |  0,01 %| +| ::: | #​typ_lit_poezja | poetry |  0.01%| 
-| spoken | #typ_qmow | quasi-spoken texts |  2,50 %| +| spoken | #typ_qmow | quasi-spoken texts |  2.50%| 
-| ::: | #typ_media | spoken media text |  2,07 %| +| ::: | #typ_media | spoken media text |  2.07%| 
-| ::: | #​typ_konwers | spoken conversational texts |  5,57 %| +| ::: | #​typ_konwers | spoken conversational texts |  5.57%| 
-| web | #​typ_net_interakt | interaction-based Internet texts |  5,18 %| +| web | #​typ_net_interakt | interaction-based Internet texts |  5.18%| 
-| ::: | #​typ_net_nieinterakt | non-interaction-based Internet texts |  1,91 %|+| ::: | #​typ_net_nieinterakt | non-interaction-based Internet texts |  1.91%|
  
 ===== Positional annotation and tagging ===== ===== Positional annotation and tagging =====
  
-Compared to typical corpora of Czech, NKJP_1M additionally has a positional attribute which is specific for Polish, the so-called **flexeme**. It is a category which further subdivides parts of speech into more specific lexeme classes. For instance, within nouns (//​subst//​),​ depreciative nouns (//depr//) form one of the flexeme subgroups; flexemes also distinguish between regular adjectives (//adj//), compound adjectives (//adja//, e.g. //​__biało__-czerwony//,​ //​__sportowo__-rekreacyjny//​),​ post-prepositional adjectives (//adjp//, e.g. //po __polsku__//,​ //od __dawna__//​),​ and predicative adjectives (//adjc//, e.g. //jestem __pewien__//,​ //był __wesół__ i __zdrów__//​);​ and there is a particularly fine-grained subcategorization of verbs (more than 10 different flexemes). ​+Compared to typical corpora of Czech, NKJP_1M additionally has a positional attribute which is specific for Polish, the so-called **flexeme**. It is a category which further subdivides parts of speech into more specific lexeme classes. For instance, within nouns (//​subst//​),​ depreciative nouns (//depr//) form one of the flexeme subgroups; flexemes also distinguish between regular adjectives (//​adj//​), ​the first part of compound adjectives (//adja//, e.g. //​__biało__-czerwony//,​ //​__sportowo__-rekreacyjny//​),​ post-prepositional adjectives (//adjp//, e.g. //po __polsku__//,​ //od __dawna__//​),​ and predicative adjectives (//adjc//, e.g. //jestem __pewien__//,​ //był __wesół__ i __zdrów__//​);​ and there is a particularly fine-grained subcategorization of verbs (more than 10 different flexemes). ​
  
 Moreover, the Polish tagset differs from the Czech one; its detailed description (including the full flexeme list) is available [[http://​nkjp.pl/​poliqarp/​help/​ense2.html|here]]. Moreover, the Polish tagset differs from the Czech one; its detailed description (including the full flexeme list) is available [[http://​nkjp.pl/​poliqarp/​help/​ense2.html|here]].