Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:cnk:uvod [2018/12/20 12:58]
Michal Škrabal [Corpora of the Czech National Corpus project]
en:cnk:uvod [2019/12/20 15:48] (current)
Michal Křen [Corpora of the Czech National Corpus project]
Line 7: Line 7:
 ^ corpus ^ size (word count) ^  lemmas ​ ^ morphological tags ^  released((For versioned corpora (e.g. [[en:​cnk:​syn|SYN]] or [[en:​cnk:​intercorp|InterCorp]]),​ the year when the first version was released is stated.)) ​ ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas ​ ^ morphological tags ^  released((For versioned corpora (e.g. [[en:​cnk:​syn|SYN]] or [[en:​cnk:​intercorp|InterCorp]]),​ the year when the first version was released is stated.)) ​ ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:​cnk:​syn|SYN]] ([[en:​cnk:​syn:​verze7|version ​7]]) |  4.255G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |+| [[en:​cnk:​syn|SYN]] ([[en:​cnk:​syn:​verze8|version ​8]]) |  4.5G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 | [[en:​cnk:​syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:​cnk:​syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
 | [[en:​cnk:​syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | | [[en:​cnk:​syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 |
Line 16: Line 16:
 | [[en:​cnk:​syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 | | [[en:​cnk:​syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
 +| [[en:​cnk:​capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek |
 +| [[en:​cnk:​capek|Capek_uplny]] |  2.5M |  ✓  |  ✓  |  2007  | author corpus of texts written or co-authored by Karel Čapek |
 +| [[en:​cnk:​cep|Cep]] |  420k |  ✓  |  ✓  |  2015  | author corpus of prosaic texts written by Jan Čep |
 | [[en:​cnk:​czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers ​ | | [[en:​cnk:​czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers ​ |
 | [[en:​cnk:​czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | | [[en:​cnk:​czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
 +| [[en:​cnk:​czesl-sgt-basic|CZESL-SGT-BASIC]] |  960k |  ✓  |  ✓  |  2019  | CZESL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
 | [[en:​cnk:​fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction | | [[en:​cnk:​fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction |
 | [[en:​cnk:​fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:​cnk:​syn2000|SYN2000]],​ source of the Frequency Dictionary of Czech | | [[en:​cnk:​fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:​cnk:​syn2000|SYN2000]],​ source of the Frequency Dictionary of Czech |
Line 24: Line 28:
 | [[en:​cnk:​ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004| | [[en:​cnk:​ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|
 | [[en:​cnk:​link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:​cnk:​link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
 +| [[en:​cnk:​net|NET]] |  41M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication |
 | [[en:​cnk:​orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell'​s novel [[wp>​Nineteen_Eighty-Four|1984]],​ manually annotated ​ | | [[en:​cnk:​orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell'​s novel [[wp>​Nineteen_Eighty-Four|1984]],​ manually annotated ​ |
 | [[en:​cnk:​skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays | | [[en:​cnk:​skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays |
Line 29: Line 34:
 ^ corpus ^ size (word count) ^  lemmas ​ ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas ​ ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
 +| [[en:​cnk:​orator|ORATOR]] |  580k |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |
 | [[en:​cnk:​ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | [[en:​cnk:​ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
 | [[en:​cnk:​oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:​cnk:​oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
Line 47: Line 53:
 ^ corpus ^ size (word count) ^  lemmas ​ ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas ​ ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:​cnk:​intercorp|InterCorp]] ([[en:​cnk:​intercorp:​verze11|version ​11]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus ​being compiled as a part of the [[http://​ucnk.ff.cuni.cz/​intercorp/?​lang=en|InterCorp project]] ​|+| [[en:​cnk:​intercorp|InterCorp]] ([[en:​cnk:​intercorp:​verze12|version ​12]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus ​for 40 languages ​|
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
-| [[en:​cnk:​aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several ​European ​languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |+| [[en:​cnk:​aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
 | [[en:​cnk:​dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German | | [[en:​cnk:​dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German |
 | [[en:​cnk:​frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French | | [[en:​cnk:​frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French |