Differences

This shows you the differences between two versions of the page.

--- en:cnk:nkjp [2018/11/05 12:20] – [Text classification] adrianzasina
+++ en:cnk:nkjp [2018/11/12 16:09] (current) – [Corpus NKJP_1M] michalkren
@@ Line 1: / Line 1: @@
 ~~NOTOC~~
-====== Corpus NKJP_1M ======
+====== The NKJP_1M corpus ======
 The NKJP_1M corpus is a manually annotated one million word subcorpus of the [[http://nkjp.pl| National Corpus of Polish]] (NKJP – //Narodowy Korpus Języka Polskiego//), composed of various text samples (see below). It is a corpus of contemporary Polish with texts published after the year 1945; it contains written, spoken and web communication. The corpus features lemmatisation, morphological annotation, and representative coverage of text categories.
@@ Line 6: / Line 6: @@
 <WRAP right 35%>
 ^ <fs medium>Name</fs> ^^ <fs medium>NKJP_1M</fs> ^
-^ Positions ^ Number of positions (tokens) |  1 215 513 |
+^ Positions ^ Number of positions (tokens) |  1,215,513 |
-^ ::: ^ Number of positions (excl. punctuation) |  992 014 |
+^ ::: ^ Number of positions (excl. punctuation) |  992,014 |
-^ ::: ^ Number of word forms |  143 477 |
+^ ::: ^ Number of word forms |  143,477 |
-^ ::: ^ Number of lemmas |  54 174 |
+^ ::: ^ Number of lemmas |  54,174 |
-^ Structures ^ Number of documents <doc> |  3 889 |
+^ Structures ^ Number of documents <doc> |  3,889 |
-^ ::: ^ Number of paragraphs <p> |  18 484 |
+^ ::: ^ Number of paragraphs <p> |  18,484 |
-^ ::: ^ Number of sentences <s> |  85 663 |
+^ ::: ^ Number of sentences <s> |  85,663 |
 ^ Further information ^ Reference corpus |  YES |
 ^ ::: ^ Representative corpus |  YES |
@@ Line 23: / Line 23: @@
 ^Communication layer^ doc.genre ^ Category ^ Proportion ^
-| written | #typ_publ | journalism |  48,85 %|
+| written | #typ_publ | journalism |  48.85%|
-| ::: | #typ_lit | fiction |  17,04 %|
+| ::: | #typ_lit | fiction |  17.04%|
-| ::: | #typ_fakt | non-fiction |  5,34 %|
+| ::: | #typ_fakt | non-fiction |  5.34%|
-| ::: | #typ_inf-por | informative texts |  5,62 %|
+| ::: | #typ_inf-por | informative texts |  5.62%|
-| ::: | #typ_urzed | legal texts |  2,97 %|
+| ::: | #typ_urzed | legal texts |  2.97%|
-| ::: | #typ_nd | popular science texts |  1,91 %|
+| ::: | #typ_nd | popular science texts |  1.91%|
-| ::: | #typ_nklas | non-fiction unclassified book |  1,00 %|
+| ::: | #typ_nklas | non-fiction unclassified book |  1.00%|
-| ::: | #typ_listy | correspondence|  0,04 %|
+| ::: | #typ_listy | correspondence|  0.04%|
-| ::: | #typ_lit_poezja | poetry |  0,01 %|
+| ::: | #typ_lit_poezja | poetry |  0.01%|
-| spoken | #typ_qmow | quasi-spoken texts |  2,50 %|
+| spoken | #typ_qmow | quasi-spoken texts |  2.50%|
-| ::: | #typ_media | spoken media text |  2,07 %|
+| ::: | #typ_media | spoken media text |  2.07%|
-| ::: | #typ_konwers | spoken conversational texts |  5,57 %|
+| ::: | #typ_konwers | spoken conversational texts |  5.57%|
-| web | #typ_net_interakt | interaction-based Internet texts |  5,18 %|
+| web | #typ_net_interakt | interaction-based Internet texts |  5.18%|
-| ::: | #typ_net_nieinterakt | non-interaction-based Internet texts |  1,91 %|
+| ::: | #typ_net_nieinterakt | non-interaction-based Internet texts |  1.91%|
 ===== Positional annotation and tagging =====
-Compared to typical corpora of Czech, NKJP_1M additionally has a positional attribute which is specific for Polish, the so-called **flexeme**. It is a category which further subdivides parts of speech into more specific lexeme classes. For instance, within nouns (//subst//), depreciative nouns (//depr//) form one of the flexeme subgroups; flexemes also distinguish between regular adjectives (//adj//), compound adjectives (//adja//, e.g. //__biało__-czerwony//, //__sportowo__-rekreacyjny//), post-prepositional adjectives (//adjp//, e.g. //po __polsku__//, //od __dawna__//), and predicative adjectives (//adjc//, e.g. //jestem __pewien__//, //był __wesół__ i __zdrów__//); and there is a particularly fine-grained subcategorization of verbs (more than 10 different flexemes).
+Compared to typical corpora of Czech, NKJP_1M additionally has a positional attribute which is specific for Polish, the so-called **flexeme**. It is a category which further subdivides parts of speech into more specific lexeme classes. For instance, within nouns (//subst//), depreciative nouns (//depr//) form one of the flexeme subgroups; flexemes also distinguish between regular adjectives (//adj//), the first part of compound adjectives (//adja//, e.g. //__biało__-czerwony//, //__sportowo__-rekreacyjny//), post-prepositional adjectives (//adjp//, e.g. //po __polsku__//, //od __dawna__//), and predicative adjectives (//adjc//, e.g. //jestem __pewien__//, //był __wesół__ i __zdrów__//); and there is a particularly fine-grained subcategorization of verbs (more than 10 different flexemes).
 Moreover, the Polish tagset differs from the Czech one; its detailed description (including the full flexeme list) is available [[http://nkjp.pl/poliqarp/help/ense2.html|here]].

Trace: • alignment • containing • syn • verze9 • syn2009pub • cermak • morfio • precision • case-sensitive • ulozit

Differences

Search

Navigation

Print/export

Tools

Languages

Licence