This is an old revision of the document!

Corpora of the Czech National Corpus project

Written synchronic corpora
corpus	size (word count)	lemmas	morphological tags	released¹⁾	characteristic features
General corpora
SYN (version 7)	4.255G	✓	✓	2010	versioned corpus, unification of all the SYN-series synchronic written corpora
SYN2015	100M	✓	✓	2015	reference representative corpus, most of the texts are from 2010–2014, with new classification of texts
SYN2013PUB	935M	✓	✓	2013	reference corpus of newspapers and magazines from 2005–2009
SYN2010	100M	✓	✓	2010	reference representative corpus, most of the texts are from 2005–2009
SYN2009PUB	700M	✓	✓	2010	reference corpus of newspapers and magazines from 1995–2007
SYN2006PUB	300M	✓	✓	2006	reference corpus of newspapers and magazines from 1989–2004
SYN2005	100M	✓	✓	2005	reference representative corpus, most of the texts are from 2000–2004
SYN2000	100M	✓	✓	2000	reference representative corpus, most of the texts are from 1990–1999
Specialized corpora
CZESL-PLAIN	2M	✗	✗	2012	non-reference learner corpus of non-native Czech speakers
CZESL-SGT	960k	✓	✓	2014	non-reference learner corpus of non-native speakers’ Czech with automatic annotation
FicTree	135k	✓	✓	2017	manually annotated treebank of Czech fiction
FSC2000	100M	✓	✗	2004	modified SYN2000, source of the Frequency Dictionary of Czech
JEROME	85M	✓	✓	2013	monolingual comparable corpus for translation studies
Koditex	10.8 mil.	✓	✓	2018	corpus for multi-dimensional analysis of Czech registers
KSK-DOPISY	800k	✗	✗	2006	transcriptions of handwritten correspondence from 1990–2004
LINK	1.8M	✓	✓	2010	non-reference corpus of linguistic texts
ORWELL	80k	✓	✓	2003	Orwell's novel 1984, manually annotated
SKRIPT2012	590k	✓	✓	2013	corpus of school essays
Spoken synchronic corpora
corpus	size (word count)	lemmas	morphological tags	year	characteristic features
General corpora
ORTOFON	1M	✓	✓	2017	reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia)
ORAL	5,4M	✓	✓	2017	reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2013	2.8M	✗	✗	2013	reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2008	1M	✗	✗	2008	reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only)
ORAL2006	1M	✗	✗	2006	reference corpus of informal spoken Czech (speakers from Bohemia only)
Specialized corpora
BMK	490k	✗	✗	2002	Brno spoken corpus
DIALEKT	100k	✓	✓	2017	reference dialectal corpus with two-layer transcription
LINDSEI_CZ	120k	✗	✗	2017	learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech
PMK	675k	✗	✗	2001	Prague spoken corpus
SCHOLA2010	790k	✗	✗	2010	corpus of school lessons
SPEECHES	215k	✗	✗	2015	corpus of presidential speeches
Diachronic corpora
corpus	size (word count)	lemmas	morphological tags	year	characteristic features
DIAKORP (version 6)	3.4M	✗	✗	2005	versioned corpus of the diachronic section of the CNC
Foreign language corpora
corpus	size (word count)	lemmas	morphological tags	year	characteristic features
Parallel corpora
InterCorp (version 11)	1.7G	(✓)	(✓)	2008	versioned parallel corpus being compiled as a part of the InterCorp project
Comparable corpora
Aranea	1G	✓	✓	2014	comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh)
deWaC	1.35G	✓	✓	2013	web corpus of German
frWaC	1.35G	✓	✓	2013	web corpus of French
itWaC	1.6G	✓	✓	2013	web corpus of Italian
ukWaC	1.9G	✓	✓	2013	web corpus of British English
Specialized foreign language corpora
DOTKO	12M	✗	✗	2010	non-reference corpus of Lower Sorbian, most of the texts are from 1848–1933
EEBO	730M	✗	✗	2015	English texts from the period 1475–-1700, Early English Books Online
HOTKO	36M	✗	✗	2013	non-reference corpus of Upper Sorbian
lEstRepublicain	73M	✓	✓	2013	corpus of French newspaper L'Est Républicain
NKJP_1M	1M	✓	✓	2018	manually annotated one-million subcorpus of the National Corpus of Polish