Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision |
en:cnk:uvod [2017/06/02 14:18] – [Corpora of the Czech National Corpus project] michalkren | en:cnk:uvod [2019/10/31 19:24] – alexandrrosen |
---|
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.)) ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.)) ^ characteristic features ^ |
| **General corpora** |||||| | | **General corpora** |||||| |
| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze5|version 5]]) | 3.836G | ✓ | ✓ | 2010 | versioned corpus, unification of all the SYN-series synchronic written corpora | | | [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze7|version 7]]) | 4.255G | ✓ | ✓ | 2010 | versioned corpus, unification of all the SYN-series synchronic written corpora | |
| [[en:cnk:syn2015|SYN2015]] | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | | [[en:cnk:syn2015|SYN2015]] | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | |
| [[en:cnk:syn2013PUB|SYN2013PUB]] | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005--2009 | | | [[en:cnk:syn2013PUB|SYN2013PUB]] | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005--2009 | |
| [[en:cnk:czesl-plain|CZESL-PLAIN]] | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers | | | [[en:cnk:czesl-plain|CZESL-PLAIN]] | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers | |
| [[en:cnk:czesl-sgt|CZESL-SGT]] | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | | | [[en:cnk:czesl-sgt|CZESL-SGT]] | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | |
| | [[en:cnk:czesl-sgt-basic|CZESL-SGT-BASIC]] | 960k | ✓ | ✓ | 2019 | same as CZESL-SGT except for a reduced set of metadata in the **Restrict search** section of the search interface | |
| | [[en:cnk:fictree|FicTree]] | 135k | ✓ | ✓ | 2017 | manually annotated treebank of Czech fiction | |
| [[en:cnk:fsc2000|FSC2000]] | 100M | ✓ | ✗ | 2004 | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | | [[en:cnk:fsc2000|FSC2000]] | 100M | ✓ | ✗ | 2004 | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | |
| [[JEROME]] | 85M | ✓ | ✓ | 2013 | monolingual comparable corpus for translation studies | | | [[en:cnk:jerome|JEROME]] | 85M | ✓ | ✓ | 2013 | monolingual comparable corpus for translation studies | |
| | [[en:cnk:koditex|Koditex]] | 10.8 mil. | ✓ | ✓ | 2018 | corpus for multi-dimensional analysis of Czech registers | |
| [[en:cnk:ksk-dopisy|KSK-DOPISY]] | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990--2004| | | [[en:cnk:ksk-dopisy|KSK-DOPISY]] | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990--2004| |
| [[en:cnk:link|LINK]] | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts | | | [[en:cnk:link|LINK]] | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts | |
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| **General corpora** |||||| | | **General corpora** |||||| |
| [[cnk:ortofon|ORTOFON]] | 1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:ortofon|ORTOFON]] | 1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | |
| [[cnk:oral|ORAL]] | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:oral|ORAL]] | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |
| [[en:cnk:oral2013|ORAL2013]] | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:oral2013|ORAL2013]] | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |
| [[en:cnk:oral2008|ORAL2008]] | 1M | ✗ | ✗ | 2008 | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) | | | [[en:cnk:oral2008|ORAL2008]] | 1M | ✗ | ✗ | 2008 | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) | |
| **Specialized corpora** |||||| | | **Specialized corpora** |||||| |
| [[en:cnk:bmk|BMK]] | 490k | ✗ | ✗ | 2002 | Brno spoken corpus | | | [[en:cnk:bmk|BMK]] | 490k | ✗ | ✗ | 2002 | Brno spoken corpus | |
| [[cnk:dialekt|DIALEKT]] | 100k | ✓ | ✓ | 2017 | reference dialectal corpus with two-layer transcription | | | [[en:cnk:dialekt|DIALEKT]] | 100k | ✓ | ✓ | 2017 | reference dialectal corpus with two-layer transcription | |
| [[en:cnk:lindsei_cz|LINDSEI_CZ]] | 120k | ✗ | ✗ | 2017 | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech | | | [[en:cnk:lindsei_cz|LINDSEI_CZ]] | 120k | ✗ | ✗ | 2017 | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech | |
| [[en:cnk:pmk|PMK]] | 675k | ✗ | ✗ | 2001 | Prague spoken corpus | | | [[en:cnk:pmk|PMK]] | 675k | ✗ | ✗ | 2001 | Prague spoken corpus | |
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| **Parallel corpora** |||||| | | **Parallel corpora** |||||| |
| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze9|version 9]]) | 1.46G | (✓) | (✓) | 2008 | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] | | | [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze11|version 11]]) | 1.7G | (✓) | (✓) | 2008 | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] | |
| **Comparable corpora** |||||| | | **Comparable corpora** |||||| |
| [[en:cnk:aranea|Aranea]] | 1G | ✓ | ✓ | 2014 | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | | | [[en:cnk:aranea|Aranea]] | 1G | ✓ | ✓ | 2014 | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | |
| **Specialized foreign language corpora** |||||| | | **Specialized foreign language corpora** |||||| |
| [[en:cnk:dotko|DOTKO]] | 12M | ✗ | ✗ | 2010 | non-reference corpus of Lower Sorbian, most of the texts are from 1848--1933 | | | [[en:cnk:dotko|DOTKO]] | 12M | ✗ | ✗ | 2010 | non-reference corpus of Lower Sorbian, most of the texts are from 1848--1933 | |
| [[cnk:eebo|EEBO]] | 730M | ✗ | ✗ | 2015 | English texts from the period 1475–-1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | | | [[en:cnk:eebo|EEBO]] | 730M | ✗ | ✗ | 2015 | English texts from the period 1475–-1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | |
| [[en:cnk:hotko|HOTKO]] | 36M | ✗ | ✗ | 2013 | non-reference corpus of Upper Sorbian | | | [[en:cnk:hotko|HOTKO]] | 36M | ✗ | ✗ | 2013 | non-reference corpus of Upper Sorbian | |
| [[en:cnk:lEstRepublicain|lEstRepublicain]] | 73M | ✓ | ✓ | 2013 | corpus of French newspaper L'Est Républicain | | | [[en:cnk:lEstRepublicain|lEstRepublicain]] | 73M | ✓ | ✓ | 2013 | corpus of French newspaper L'Est Républicain | |
| | [[en:cnk:nkjp|NKJP_1M]] | 1M | ✓ | ✓ | 2018 | manually annotated one-million subcorpus of the National Corpus of Polish | |