AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
en:cnk:uvod [2019/10/31 19:24] Alexandr Rosenen:cnk:uvod [2019/12/20 15:48] – [Corpora of the Czech National Corpus project] Michal Křen
Line 7: Line 7:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze7|version 7]]) |  4.255G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |+| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) |  4.5G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
 | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 |
Line 16: Line 16:
 | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 | | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
 +| [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek |
 +| [[en:cnk:capek|Capek_uplny]] |  2.5M |  ✓  |  ✓  |  2007  | author corpus of texts written or co-authored by Karel Čapek |
 +| [[en:cnk:cep|Cep]] |  420k |  ✓  |  ✓  |  2015  | author corpus of prosaic texts written by Jan Čep |
 | [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers  | | [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers  |
 | [[en:cnk:czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | | [[en:cnk:czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
-| [[en:cnk:czesl-sgt-basic|CZESL-SGT-BASIC]] |  960k |  ✓  |  ✓  |  2019  | same as CZESL-SGT except for a reduced set of metadata in the **Restrict search** section of the search interface |+| [[en:cnk:czesl-sgt-basic|CZESL-SGT-BASIC]] |  960k |  ✓  |  ✓  |  2019  | CZESL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
 | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction | | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction |
 | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
Line 25: Line 28:
 | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004| | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
 +| [[en:cnk:net|NET]] |  41M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication |
 | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  | | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays | | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays |
Line 30: Line 34:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
 +| [[en:cnk:orator|ORATOR]] |  580k |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |
 | [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
Line 48: Line 53:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze11|version 11]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze12|version 12]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus for 40 languages |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
-| [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |+| [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
 | [[en:cnk:dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German | | [[en:cnk:dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German |
 | [[en:cnk:frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French | | [[en:cnk:frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French |