Corpora of the Czech National Corpus project

Written synchronic corpora
corpus size (word count) lemmas morphological tags released1) characteristic features
General corpora
SYN (version 12) 5G 2010–2023 versioned corpus, unification of all the SYN-series synchronic written corpora
SYN2020 100M 2020 reference representative corpus, most of the texts are from 2014–2019
SYN2015 100M 2015 reference representative corpus, most of the texts are from 2010–2014, with new classification of texts
SYN2013PUB 935M 2013 reference corpus of newspapers and magazines from 2005–2009
SYN2010 100M 2010 reference representative corpus, most of the texts are from 2005–2009
SYN2009PUB 700M 2010 reference corpus of newspapers and magazines from 1995–2007
SYN2006PUB 300M 2006 reference corpus of newspapers and magazines from 1989–2004
SYN2005 100M 2005 reference representative corpus, most of the texts are from 2000–2004
SYN2000 100M 2000 reference representative corpus, most of the texts are from 1990–1999
Web corpora
ONLINE (2nd generation) > 6G 2020 monitor corpus of Czech internet
NET (version 2) 176M 2019 corpus of semi-official internet communication
Learner corpora
CzeSL-man 100k 2016 non-reference learner corpus of non-native Czech speakers with manual error annotation
CzeSL-plain 2M 2012 non-reference learner corpus of non-native Czech speakers
CzeSL-SGT 960k 2014 non-reference learner corpus of non-native speakers’ Czech with automatic annotation
CzeSL-SGT-basic 960k 2019 CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface
SKRIPT2012 590k 2013 corpus of school essays
VESPA_CZ 500k 2022 learner corpus of written academic English by advanced speakers, whose L1 is Czech
Author corpora
Capek 2.3M 2007 author corpus of texts written exclusively by Karel Čapek
Capek_uplny 2.5M 2007 author corpus of texts written or co-authored by Karel Čapek
Cep 420k 2015 author corpus of prosaic texts written by Jan Čep
KH-DOPISY 500k 2017 corpus of Karel Havlíček's correspondence
KH-NOVINY 1M 2021 corpus of Karel Havlíček's journalism
Klaus 1.5M 2024 corpus of Václav Klaus' texts
ORWELL 80k 2003 Orwell's novel 1984, manually annotated
Specialized corpora
Etalon 1.9M 2021 manually annotated corpus of Czech texts
FicTree 135k 2017 manually annotated treebank of Czech fiction
FSC2000 100M 2004 modified SYN2000, source of the Frequency Dictionary of Czech
JEROME 85M 2013 monolingual comparable corpus for translation studies
Koditex 10.8M 2018 corpus for multi-dimensional analysis of Czech registers
KSK-DOPISY 800k 2006 transcriptions of handwritten correspondence from 1990–2004
KSP (version 2) 37.5M 2022 corpus of contemporary Czech poetry published in books and on literary servers from 1990–2020
LINK 1.8M 2010 non-reference corpus of linguistic texts
Totalita 12,9M 2010 written language of the communist regime
Věda 15M 2023 corpus of scientific Czech, complement to the Phrase Bank of Academic Czech
Spoken synchronic corpora
corpus size (word count) lemmas morphological tags year characteristic features
General corpora
ORATOR (version 2) 1.2M 2019 reference corpus of monologues with one-layer transcription
ORTOFON (version 3) 2.4M 2017 reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia)
ORAL (version 1) 5,4M 2017 reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2013 2.8M 2013 reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia)
ORAL2008 1M 2008 reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only)
ORAL2006 1M 2006 reference corpus of informal spoken Czech (speakers from Bohemia only)
Specialized corpora
BMK 490k 2002 Brno spoken corpus
DIALEKT (version 2) 223k 2017 reference dialectal corpus with two-layer transcription
LINDSEI_CZ 120k 2017 learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech
PMK 675k 2001 Prague spoken corpus
SCHOLA2010 790k 2010 corpus of school lessons
SPEECHES 215k 2015 corpus of presidential speeches
Parlcorp 38M 2015 corpus of Czech parliamentary speeches (1993-2021)
Diachronic corpora
corpus size (word count) lemmas morphological tags year characteristic features
DIAKORP (version 6) 3.4M 2005 versioned corpus of the diachronic section of the CNC
OnomOs 200k 2023 corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation
Foreign language corpora
corpus size (word count) lemmas morphological tags year characteristic features
Parallel corpora
InterCorp (release 16, release 16ud) 5.3G (✓) (✓) 2008–2024 versioned parallel corpus for 61 languages
Psalm 77 10k (✓) (✓) 2023 parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek
Comparable corpora
Aranea 1G 2014 comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh)
deWaC 1.35G 2013 web corpus of German
frWaC 1.35G 2013 web corpus of French
itWaC 1.6G 2013 web corpus of Italian
ukWaC 1.9G 2013 web corpus of British English
Specialized foreign language corpora
Baltische Briefe 300k 2024 corpus of German historical newspaper Baltische Briefe
CODIT 27M 2021 diachronic corpus of Italian covering a period from the 13th century until 1947
DOTKO (version 2) 15.5M 2010 non-reference corpus of Lower Sorbian
EEBO 730M 2015 English texts from the period 1475–1700, Early English Books Online
HOTKO (version 2) 36M 2013 non-reference corpus of Upper Sorbian
lEstRepublicain 73M 2013 corpus of French newspaper L'Est Républicain
NKJP_1M 1M 2018 manually annotated one-million subcorpus of the National Corpus of Polish
OBC 24M 2021 Old Bailey Corpus, trial proceedings from 1720–1913
1)
For versioned corpora (e.g. SYN or InterCorp), the year when the first version was released is also stated.