This is an old revision of the document!

Corpora of the Czech National Corpus project

Written synchronic corpora
corpus	size (word count)	lemmas	morphological tags	released¹⁾	characteristic features
General corpora
SYN (version 12)	5G	✓	✓	2010–2023	versioned corpus, unification of all the SYN-series synchronic written corpora
SYN2020	100M	✓	✓	2020	reference representative corpus, most of the texts are from 2014–2019
SYN2015	100M	✓	✓	2015	reference representative corpus, most of the texts are from 2010–2014, with new classification of texts
SYN2013PUB	935M	✓	✓	2013	reference corpus of newspapers and magazines from 2005–2009
SYN2010	100M	✓	✓	2010	reference representative corpus, most of the texts are from 2005–2009
SYN2009PUB	700M	✓	✓	2010	reference corpus of newspapers and magazines from 1995–2007
SYN2006PUB	300M	✓	✓	2006	reference corpus of newspapers and magazines from 1989–2004
SYN2005	100M	✓	✓	2005	reference representative corpus, most of the texts are from 2000–2004
SYN2000	100M	✓	✓	2000	reference representative corpus, most of the texts are from 1990–1999
Web corpora
ONLINE (2nd generation)	> 6G	✓	✓	2020	monitor corpus of Czech internet
NET (version 2)	176M	✓	✓	2019	corpus of semi-official internet communication
Learner corpora
CzeSL-man	100k	✓	✓	2016	non-reference learner corpus of non-native Czech speakers with manual error annotation
CzeSL-plain	2M	✗	✗	2012	non-reference learner corpus of non-native Czech speakers
CzeSL-SGT	960k	✓	✓	2014	non-reference learner corpus of non-native speakers’ Czech with automatic annotation
CzeSL-SGT-basic	960k	✓	✓	2019	CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface
SKRIPT2012	590k	✓	✓	2013	corpus of school essays
VESPA_CZ	500k	✓	✓	2022	learner corpus of written academic English by advanced speakers, whose L1 is Czech
Author corpora
Capek	2.3M	✓	✓	2007	author corpus of texts written exclusively by Karel Čapek
Capek_uplny	2.5M	✓	✓	2007	author corpus of texts written or co-authored by Karel Čapek
Cep	420k	✓	✓	2015	author corpus of prosaic texts written by Jan Čep
KH-DOPISY	500k	✗	✗	2017	corpus of Karel Havlíček's correspondence
KH-NOVINY	1M	✗	✗	2021	corpus of Karel Havlíček's journalism
ORWELL	80k	✓	✓	2003	Orwell's novel 1984, manually annotated
Specialized corpora
Etalon	1.9M	✓	✓	2021	manually annotated corpus of Czech texts
FicTree	135k	✓	✓	2017	manually annotated treebank of Czech fiction
FSC2000	100M	✓	✗	2004	modified SYN2000, source of the Frequency Dictionary of Czech
JEROME	85M	✓	✓	2013	monolingual comparable corpus for translation studies
Koditex	10.8M	✓	✓	2018	corpus for multi-dimensional analysis of Czech registers
KSK-DOPISY	800k	✗	✗	2006	transcriptions of handwritten correspondence from 1990–2004
KSP (version 2)	37.5M	✓	✓	2022	corpus of contemporary Czech poetry published in books and on literary servers from 1990–2020
LINK	1.8M	✓	✓	2010	non-reference corpus of linguistic texts
Totalita	12,9M	✓	✓	2010	written language of the communist regime
Věda	15M	✓	✓	2023	corpus of scientific Czech, complement to the Phrase Bank of Academic Czech
Spoken synchronic corpora
corpus	size (word count)	lemmas	morphological tags	year	characteristic features
General corpora
ORATOR (version 2)	1.2M	✓	✓	2019	reference corpus of monologues with one-layer transcription
ORTOFON (version 3)	2.4M	✓	✓	2017	reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia)
ORAL (version 1)	5,4M	✓	✓	2017	reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2013	2.8M	✗	✗	2013	reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2008	1M	✗	✗	2008	reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only)
ORAL2006	1M	✗	✗	2006	reference corpus of informal spoken Czech (speakers from Bohemia only)
Specialized corpora
BMK	490k	✗	✗	2002	Brno spoken corpus
DIALEKT (version 2)	223k	✓	✓	2017	reference dialectal corpus with two-layer transcription
LINDSEI_CZ	120k	✗	✗	2017	learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech
PMK	675k	✗	✗	2001	Prague spoken corpus
SCHOLA2010	790k	✗	✗	2010	corpus of school lessons
SPEECHES	215k	✗	✗	2015	corpus of presidential speeches
Parlcorp	38M	✓	✓	2015	corpus of Czech parliamentary speeches (1993-2021)
Diachronic corpora
corpus	size (word count)	lemmas	morphological tags	year	characteristic features
DIAKORP (version 6)	3.4M	✗	✗	2005	versioned corpus of the diachronic section of the CNC
OnomOs	200k	✓	✓	2023	corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation
Foreign language corpora
corpus	size (word count)	lemmas	morphological tags	year	characteristic features
Parallel corpora
InterCorp (release 16, release 16ud)	5.3G	(✓)	(✓)	2008–2024	versioned parallel corpus for 61 languages
Psalm 77	10k	(✓)	(✓)	2023	parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek
Comparable corpora
Aranea	1G	✓	✓	2014	comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh)
deWaC	1.35G	✓	✓	2013	web corpus of German
frWaC	1.35G	✓	✓	2013	web corpus of French
itWaC	1.6G	✓	✓	2013	web corpus of Italian
ukWaC	1.9G	✓	✓	2013	web corpus of British English
Specialized foreign language corpora
Baltische Briefe	300k	✓	✓	2024	corpus of German historical newspaper Baltische Briefe
CODIT	27M	✗	✗	2021	diachronic corpus of Italian covering a period from the 13th century until 1947
DOTKO (version 2)	15.5M	✓	✗	2010	non-reference corpus of Lower Sorbian
EEBO	730M	✗	✗	2015	English texts from the period 1475–1700, Early English Books Online
HOTKO (version 2)	36M	✗	✗	2013	non-reference corpus of Upper Sorbian
lEstRepublicain	73M	✓	✓	2013	corpus of French newspaper L'Est Républicain
NKJP_1M	1M	✓	✓	2018	manually annotated one-million subcorpus of the National Corpus of Polish
OBC	24M	✗	✓	2021	Old Bailey Corpus, trial proceedings from 1720–1913