Differences
This shows you the differences between two versions of the page.
Next revision | Previous revisionLast revisionBoth sides next revision | ||
en:cnk:nkjp [2018/11/02 13:05] – created adrianzasina | en:cnk:nkjp [2018/11/06 10:33] – [Corpus NKJP_1M] numbers adrianzasina | ||
---|---|---|---|
Line 2: | Line 2: | ||
====== Corpus NKJP_1M ====== | ====== Corpus NKJP_1M ====== | ||
- | NKJP_1M corpus is a manually annotated one-milion | + | The NKJP_1M corpus is a manually annotated one million |
<WRAP right 35%> | <WRAP right 35%> | ||
^ <fs medium> | ^ <fs medium> | ||
- | ^ Positions ^ Number of positions (tokens) | 1 215 513 | | + | ^ Positions ^ Number of positions (tokens) | 1,215,513 | |
- | ^ ::: ^ Number of positions (excl. punctuation) | 992 014 | | + | ^ ::: ^ Number of positions (excl. punctuation) | 992,014 | |
- | ^ ::: ^ Number of word forms | 143 477 | | + | ^ ::: ^ Number of word forms | 143,477 | |
- | ^ ::: ^ Number of lemmas | 54 174 | | + | ^ ::: ^ Number of lemmas | 54,174 | |
- | ^ Structures ^ Number of documents <doc> | 3 889 | | + | ^ Structures ^ Number of documents <doc> | 3,889 | |
- | ^ ::: ^ Number of paragraphs <p> | 18 484 | | + | ^ ::: ^ Number of paragraphs <p> | 18,484 | |
- | ^ ::: ^ Number of sentences <s> | 85 663 | | + | ^ ::: ^ Number of sentences <s> | 85,663 | |
- | ^ Further information ^ Reference corpus | | + | ^ Further information ^ Reference corpus | |
- | ^ ::: ^ Representative corpus | | + | ^ ::: ^ Representative corpus | |
^ ::: ^ Publication year | 2018 | | ^ ::: ^ Publication year | 2018 | | ||
</ | </ | ||
===== Text classification ===== | ===== Text classification ===== | ||
- | Text classification of NKJP_1M combines traditional and thematic-genre text categorisation. Text categorisation into genre (in Polish corpus terminology rather | + | |
+ | The classification of NKJP_1M | ||
^Communication layer^ doc.genre ^ Category ^ Proportion ^ | ^Communication layer^ doc.genre ^ Category ^ Proportion ^ | ||
- | | written | #typ_publ | journalism | 48,85 %| | + | | written | #typ_publ | journalism | 48.85%| |
- | | ::: | #typ_lit | fiction | 17,04 %| | + | | ::: | #typ_lit | fiction | 17.04%| |
- | | ::: | #typ_fakt | non-fiction | 5,34 %| | + | | ::: | #typ_fakt | non-fiction | 5.34%| |
- | | ::: | # | + | | ::: | # |
- | | ::: | #typ_urzed | legal texts | 2,97 %| | + | | ::: | #typ_urzed | legal texts | 2.97%| |
- | | ::: | #typ_nd | scientific and teaching | + | | ::: | #typ_nd | popular science |
- | | ::: | #typ_nklas | non-fiction unclassified book | 1,00 %| | + | | ::: | #typ_nklas | non-fiction unclassified book | 1.00%| |
- | | ::: | #typ_listy | correspondence| | + | | ::: | #typ_listy | correspondence| |
- | | ::: | # | + | | ::: | # |
- | | spoken | #typ_qmow | quasi-spoken texts | 2,50 %| | + | | spoken | #typ_qmow | quasi-spoken texts | 2.50%| |
- | | ::: | #typ_media | spoken media text | 2,07 %| | + | | ::: | #typ_media | spoken media text | 2.07%| |
- | | ::: | # | + | | ::: | # |
- | | web | # | + | | web | # |
- | | ::: | # | + | | ::: | # |
===== Positional annotation and tagging ===== | ===== Positional annotation and tagging ===== | ||
- | Compared to the Czech corpora NKJP_1M has in addition | + | Compared to typical |
- | Moreover, the Polish tagset differs from the Czech one; its detailed description (including the list of all flexemes) is available [[http:// | + | Moreover, the Polish tagset differs from the Czech one; its detailed description (including the full flexeme |
- | In addtion, two positional attributes were added to the original corpus: '' | + | In addtion, two positional attributes were added to the original corpus: '' |
====== How to cite NKJP_1M ====== | ====== How to cite NKJP_1M ====== |