Skrýt
Nastavení

Corpora of the Czech National Corpus project

Written synchronic corpora
corpus size (word count) lemmas morphological tags released1) characteristic features
General corpora
SYN (version 5) 3.836G 2010 versioned corpus, unification of all the SYN-series synchronic written corpora
SYN2015 100M 2015 reference representative corpus, most of the texts are from 2010–2014, with new classification of texts
SYN2013PUB 935M 2013 reference corpus of newspapers and magazines from 2005–2009
SYN2010 100M 2010 reference representative corpus, most of the texts are from 2005–2009
SYN2009PUB 700M 2010 reference corpus of newspapers and magazines from 1995–2007
SYN2006PUB 300M 2006 reference corpus of newspapers and magazines from 1989–2004
SYN2005 100M 2005 reference representative corpus, most of the texts are from 2000–2004
SYN2000 100M 2000 reference representative corpus, most of the texts are from 1990–1999
Specialized corpora
CZESL-PLAIN 2M 2012 non-reference learner corpus of non-native Czech speakers
CZESL-SGT 960k 2014 non-reference learner corpus of non-native speakers’ Czech with automatic annotation
FSC2000 100M 2004 modified SYN2000, source of the Frequency Dictionary of Czech
JEROME 85M 2013 monolingual comparable corpus for translation studies
KSK-DOPISY 800k 2006 transcriptions of handwritten correspondence from 1990–2004
LINK 1.8M 2010 non-reference corpus of linguistic texts
ORWELL 80k 2003 Orwell's novel 1984, manually annotated
SKRIPT2012 590k 2013 corpus of school essays
Spoken synchronic corpora
corpus size (word count) lemmas morphological tags year characteristic features
General corpora
ORTOFON 1M 2017 reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia)
ORAL 5,4M 2017 reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2013 2.8M 2013 reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2008 1M 2008 reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only)
ORAL2006 1M 2006 reference corpus of informal spoken Czech (speakers from Bohemia only)
Specialized corpora
BMK 490k 2002 Brno spoken corpus
DIALEKT 100k 2017 reference dialectal corpus with two-layer transcription
LINDSEI_CZ 120k 2017 learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech
PMK 675k 2001 Prague spoken corpus
SCHOLA2010 790k 2010 corpus of school lessons
SPEECHES 215k 2015 corpus of presidential speeches
Diachronic corpora
corpus size (word count) lemmas morphological tags year characteristic features
DIAKORP (version 6) 3.4M 2005 versioned corpus of the diachronic section of the CNC
Foreign language corpora
corpus size (word count) lemmas morphological tags year characteristic features
Parallel corpora
InterCorp (version 9) 1.46G (✓) (✓) 2008 versioned parallel corpus being compiled as a part of the InterCorp project
Comparable corpora
Aranea 1G 2014 comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh)
deWaC 1.35G 2013 web corpus of German
frWaC 1.35G 2013 web corpus of French
itWaC 1.6G 2013 web corpus of Italian
ukWaC 1.9G 2013 web corpus of British English
Specialized foreign language corpora
DOTKO 12M 2010 non-reference corpus of Lower Sorbian, most of the texts are from 1848–1933
EEBO 730M 2015 English texts from the period 1475–-1700, Early English Books Online
HOTKO 36M 2013 non-reference corpus of Upper Sorbian
lEstRepublicain 73M 2013 corpus of French newspaper L'Est Républicain
1)
For versioned corpora (e.g. SYN or InterCorp), the year when the first version was released is stated.