Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision |
en:cnk:uvod [2019/12/20 15:36] – [Corpora of the Czech National Corpus project] michalkren | en:cnk:uvod [2020/12/25 21:48] – [Corpora of the Czech National Corpus project] michalkren |
---|
| **General corpora** |||||| | | **General corpora** |||||| |
| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) | 4.5G | ✓ | ✓ | 2010 | versioned corpus, unification of all the SYN-series synchronic written corpora | | | [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) | 4.5G | ✓ | ✓ | 2010 | versioned corpus, unification of all the SYN-series synchronic written corpora | |
| | [[en:cnk:syn2015|SYN2020]] | 100M | ✓ | ✓ | 2020 | reference representative corpus, most of the texts are from 2014--2019 | |
| [[en:cnk:syn2015|SYN2015]] | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | | [[en:cnk:syn2015|SYN2015]] | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | |
| [[en:cnk:syn2013PUB|SYN2013PUB]] | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005--2009 | | | [[en:cnk:syn2013PUB|SYN2013PUB]] | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005--2009 | |
| [[en:cnk:syn2000|SYN2000]] | 100M | ✓ | ✓ | 2000 | reference representative corpus, most of the texts are from 1990--1999 | | | [[en:cnk:syn2000|SYN2000]] | 100M | ✓ | ✓ | 2000 | reference representative corpus, most of the texts are from 1990--1999 | |
| **Specialized corpora** |||||| | | **Specialized corpora** |||||| |
| [[en:cnk:czesl-plain|CZESL-PLAIN]] | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers | | | [[en:cnk:capek|Capek]] | 2.3M | ✓ | ✓ | 2007 | author corpus of texts written exclusively by Karel Čapek | |
| [[en:cnk:czesl-sgt|CZESL-SGT]] | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | | | [[en:cnk:capek|Capek_uplny]] | 2.5M | ✓ | ✓ | 2007 | author corpus of texts written or co-authored by Karel Čapek | |
| [[en:cnk:czesl-sgt-basic|CZESL-SGT-BASIC]] | 960k | ✓ | ✓ | 2019 | CZESL-SGT with a reduced set of metadata in the Restrict search section of the search interface | | | [[en:cnk:cep|Cep]] | 420k | ✓ | ✓ | 2015 | author corpus of prosaic texts written by Jan Čep | |
| | [[en:cnk:czesl-man|CzeSL-man]] | 100k | ✓ | ✓ | 2016 | non-reference learner corpus of non-native Czech speakers with manual error annotation | |
| | [[en:cnk:czesl-plain|CzeSL-plain]] | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers | |
| | [[en:cnk:czesl-sgt|CzeSL-SGT]] | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | |
| | [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] | 960k | ✓ | ✓ | 2019 | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface | |
| [[en:cnk:fictree|FicTree]] | 135k | ✓ | ✓ | 2017 | manually annotated treebank of Czech fiction | | | [[en:cnk:fictree|FicTree]] | 135k | ✓ | ✓ | 2017 | manually annotated treebank of Czech fiction | |
| [[en:cnk:fsc2000|FSC2000]] | 100M | ✓ | ✗ | 2004 | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | | [[en:cnk:fsc2000|FSC2000]] | 100M | ✓ | ✗ | 2004 | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | |
| [[en:cnk:link|LINK]] | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts | | | [[en:cnk:link|LINK]] | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts | |
| [[en:cnk:net|NET]] | 41M | ✓ | ✓ | 2019 | corpus of semi-official internet communication | | | [[en:cnk:net|NET]] | 41M | ✓ | ✓ | 2019 | corpus of semi-official internet communication | |
| | [[en:cnk:online|ONLINE]] | > 6 bil. | ✓ | ✓ | 2020 | monitor corpus of Czech internet | |
| [[en:cnk:orwell|ORWELL]] | 80k | ✓ | ✓ | 2003 | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated | | | [[en:cnk:orwell|ORWELL]] | 80k | ✓ | ✓ | 2003 | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated | |
| [[en:cnk:skript2012|SKRIPT2012]] | 590k | ✓ | ✓ | 2013 | corpus of school essays | | | [[en:cnk:skript2012|SKRIPT2012]] | 590k | ✓ | ✓ | 2013 | corpus of school essays | |
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| **General corpora** |||||| | | **General corpora** |||||| |
| [[en:cnk:orator|ORATOR]] | 580k | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription | | | [[en:cnk:orator|ORATOR]] | 1.2M | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription | |
| [[en:cnk:ortofon|ORTOFON]] | 1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:ortofon|ORTOFON]] | 2.1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | |
| [[en:cnk:oral|ORAL]] | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:oral|ORAL]] | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |
| [[en:cnk:oral2013|ORAL2013]] | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:oral2013|ORAL2013]] | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| **Parallel corpora** |||||| | | **Parallel corpora** |||||| |
| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze12|version 12]]) | 1.7G | (✓) | (✓) | 2008 | versioned parallel corpus for 40 languages | | | [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13|version 13]]) | 1.8G | (✓) | (✓) | 2008–2020 | versioned parallel corpus for 40 languages | |
| **Comparable corpora** |||||| | | **Comparable corpora** |||||| |
| [[en:cnk:aranea|Aranea]] | 1G | ✓ | ✓ | 2014 | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | | | [[en:cnk:aranea|Aranea]] | 1G | ✓ | ✓ | 2014 | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | |