This is an old revision of the document!
Corpora of the Czech National Corpus project
Written synchronic corpora | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
corpus | size (word count) | lemmas | morphological tags | released1) | characteristic features | |||||||
General corpora | ||||||||||||
SYN (version 5) | 3.836G | ✓ | ✓ | 2010 | versioned corpus, unification of all the SYN-series synchronic written corpora | |||||||
SYN2015 | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010–2014, with new classification of texts | |||||||
SYN2013PUB | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005–2009 | |||||||
SYN2010 | 100M | ✓ | ✓ | 2010 | reference representative corpus, most of the texts are from 2005–2009 | |||||||
SYN2009PUB | 700M | ✓ | ✓ | 2010 | reference corpus of newspapers and magazines from 1995–2007 | |||||||
SYN2006PUB | 300M | ✓ | ✓ | 2006 | reference corpus of newspapers and magazines from 1989–2004 | |||||||
SYN2005 | 100M | ✓ | ✓ | 2005 | reference representative corpus, most of the texts are from 2000–2004 | |||||||
SYN2000 | 100M | ✓ | ✓ | 2000 | reference representative corpus, most of the texts are from 1990–1999 | |||||||
Specialized corpora | ||||||||||||
CZESL-PLAIN | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers | |||||||
CZESL-SGT | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | |||||||
FSC2000 | 100M | ✓ | ✗ | 2004 | modified SYN2000, source of the Frequency Dictionary of Czech | |||||||
JEROME | 85M | ✓ | ✓ | 2013 | monolingual comparable corpus for translation studies | |||||||
KSK-DOPISY | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990–2004 | |||||||
LINK | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts | |||||||
ORWELL | 80k | ✓ | ✓ | 2003 | Orwell's novel 1984, manually annotated | |||||||
SKRIPT2012 | 590k | ✓ | ✓ | 2013 | corpus of school essays | |||||||
Spoken synchronic corpora | ||||||||||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features | |||||||
General corpora | ||||||||||||
ORTOFON | 1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | |||||||
ORAL | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |||||||
ORAL2013 | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |||||||
ORAL2008 | 1M | ✗ | ✗ | 2008 | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) | |||||||
ORAL2006 | 1M | ✗ | ✗ | 2006 | reference corpus of informal spoken Czech (speakers from Bohemia only) | |||||||
Specialized corpora | ||||||||||||
BMK | 490k | ✗ | ✗ | 2002 | Brno spoken corpus | |||||||
DIALEKT | 100k | ✓ | ✓ | 2017 | reference dialectal corpus with two-layer transcription | LINDSEI_CZ | 120k | ✗ | ✗ | 2017 | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech | |
PMK | 675k | ✗ | ✗ | 2001 | Prague spoken corpus | |||||||
SCHOLA2010 | 790k | ✗ | ✗ | 2010 | corpus of school lessons | |||||||
SPEECHES | 215k | ✗ | ✗ | 2015 | corpus of presidential speeches | |||||||
Diachronic corpora | ||||||||||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features | |||||||
DIAKORP (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC | |||||||
Foreign language corpora | ||||||||||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features | |||||||
Parallel corpora | ||||||||||||
InterCorp (version 9) | 1.46G | (✓) | (✓) | 2008 | versioned parallel corpus being compiled as a part of the InterCorp project | |||||||
Comparable corpora | ||||||||||||
Aranea | 1G | ✓ | ✓ | 2014 | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | |||||||
deWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of German | |||||||
frWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of French | |||||||
itWaC | 1.6G | ✓ | ✓ | 2013 | web corpus of Italian | |||||||
ukWaC | 1.9G | ✓ | ✓ | 2013 | web corpus of British English | |||||||
Specialized foreign language corpora | ||||||||||||
DOTKO | 12M | ✗ | ✗ | 2010 | non-reference corpus of Lower Sorbian, most of the texts are from 1848–1933 | |||||||
EEBO | 730M | ✗ | ✗ | 2015 | English texts from the period 1475–-1700, Early English Books Online | |||||||
HOTKO | 36M | ✗ | ✗ | 2013 | non-reference corpus of Upper Sorbian | |||||||
lEstRepublicain | 73M | ✓ | ✓ | 2013 | corpus of French newspaper L'Est Républicain |