This is an old revision of the document!
Corpora of the Czech National Corpus project
Written synchronic corpora | |||||
---|---|---|---|---|---|
corpus | size (word count) | lemmas | morphological tags | released1) | characteristic features |
General corpora | |||||
SYN (version 5) | 3.836G | ✓ | ✓ | 2010 | versioned corpus, unification of all the SYN-series synchronic written corpora |
SYN2015 | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010–2014, with new classification of texts |
SYN2013PUB | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005–2009 |
SYN2010 | 100M | ✓ | ✓ | 2010 | reference representative corpus, most of the texts are from 2005–2009 |
SYN2009PUB | 700M | ✓ | ✓ | 2010 | reference corpus of newspapers and magazines from 1995–2007 |
SYN2006PUB | 300M | ✓ | ✓ | 2006 | reference corpus of newspapers and magazines from 1989–2004 |
SYN2005 | 100M | ✓ | ✓ | 2005 | reference representative corpus, most of the texts are from 2000–2004 |
SYN2000 | 100M | ✓ | ✓ | 2000 | reference representative corpus, most of the texts are from 1990–1999 |
Specialized corpora | |||||
CZESL-PLAIN | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers |
CZESL-SGT | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
FicTree | 135k | ✓ | ✓ | 2017 | manually annotated treebank of Czech fiction |
FSC2000 | 100M | ✓ | ✗ | 2004 | modified SYN2000, source of the Frequency Dictionary of Czech |
JEROME | 85M | ✓ | ✓ | 2013 | monolingual comparable corpus for translation studies |
KSK-DOPISY | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990–2004 |
LINK | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts |
ORWELL | 80k | ✓ | ✓ | 2003 | Orwell's novel 1984, manually annotated |
SKRIPT2012 | 590k | ✓ | ✓ | 2013 | corpus of school essays |
Spoken synchronic corpora | |||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
General corpora | |||||
ORTOFON | 1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
ORAL | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
ORAL2013 | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
ORAL2008 | 1M | ✗ | ✗ | 2008 | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) |
ORAL2006 | 1M | ✗ | ✗ | 2006 | reference corpus of informal spoken Czech (speakers from Bohemia only) |
Specialized corpora | |||||
BMK | 490k | ✗ | ✗ | 2002 | Brno spoken corpus |
DIALEKT | 100k | ✓ | ✓ | 2017 | reference dialectal corpus with two-layer transcription |
LINDSEI_CZ | 120k | ✗ | ✗ | 2017 | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
PMK | 675k | ✗ | ✗ | 2001 | Prague spoken corpus |
SCHOLA2010 | 790k | ✗ | ✗ | 2010 | corpus of school lessons |
SPEECHES | 215k | ✗ | ✗ | 2015 | corpus of presidential speeches |
Diachronic corpora | |||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
DIAKORP (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC |
Foreign language corpora | |||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
Parallel corpora | |||||
InterCorp (version 10) | 1.48G | (✓) | (✓) | 2008 | versioned parallel corpus being compiled as a part of the InterCorp project |
Comparable corpora | |||||
Aranea | 1G | ✓ | ✓ | 2014 | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
deWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of German |
frWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of French |
itWaC | 1.6G | ✓ | ✓ | 2013 | web corpus of Italian |
ukWaC | 1.9G | ✓ | ✓ | 2013 | web corpus of British English |
Specialized foreign language corpora | |||||
DOTKO | 12M | ✗ | ✗ | 2010 | non-reference corpus of Lower Sorbian, most of the texts are from 1848–1933 |
EEBO | 730M | ✗ | ✗ | 2015 | English texts from the period 1475–-1700, Early English Books Online |
HOTKO | 36M | ✗ | ✗ | 2013 | non-reference corpus of Upper Sorbian |
lEstRepublicain | 73M | ✓ | ✓ | 2013 | corpus of French newspaper L'Est Républicain |