AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
en:cnk:uvod [2019/12/20 11:48] – [Corpora of the Czech National Corpus project] michalkrenen:cnk:uvod [2020/12/22 16:59] – [Corpora of the Czech National Corpus project] michalskrabal
Line 8: Line 8:
 | **General corpora** |||||| | **General corpora** ||||||
 | [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) |  4.5G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora | | [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) |  4.5G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 +| [[en:cnk:syn2015|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
 | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 |
Line 16: Line 17:
 | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 | | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
-| [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers +| [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek | 
-| [[en:cnk:czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | +| [[en:cnk:capek|Capek_uplny]] |  2.5M |  ✓  |  ✓  |  2007  | author corpus of texts written or co-authored by Karel Čapek | 
-| [[en:cnk:czesl-sgt-basic|CZESL-SGT-BASIC]] |  960k |  ✓  |  ✓  |  2019  | CZESL-SGT with a reduced set of metadata in the Restrict search section of the search interface |+| [[en:cnk:cep|Cep]] |  420k |  ✓  |  ✓  |  2015  | author corpus of prosaic texts written by Jan Čep | 
 +| [[en:cnk:czesl-man|CzeSL-man]] |  100k |  ✓  |  ✓  |  2016  | non-reference learner corpus of non-native Czech speakers with manual error annotation 
 +| [[en:cnk:czesl-plain|CzeSL-plain]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers 
 +| [[en:cnk:czesl-sgt|CzeSL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | 
 +| [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] |  960k |  ✓  |  ✓  |  2019  | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
 | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction | | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction |
 | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
Line 25: Line 30:
 | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004| | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
 +| [[en:cnk:net|NET]] |  41M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication |
 +| [[en:cnk:online|ONLINE]] |  > 6 bil. |  ✓  |  ✓  |  2020  | monitor corpus of Czech internet |
 | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  | | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays | | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays |
Line 30: Line 37:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
 +| [[en:cnk:orator|ORATOR]] |  580k |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |
 | [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
Line 48: Line 56:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze12|version 12]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus being compiled as a part of the InterCorp project |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13|version 13]]) |  1.8G |  (✓)  |  (✓)  |  2008–2020  | versioned parallel corpus for 40 languages |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
 | [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | | [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |