This is an old revision of the document!
Corpora of the Czech National Corpus project
Written synchronic corpora | |||||
---|---|---|---|---|---|
corpus | size (word count) | lemmas | morphological tags | released1) | characteristic features |
General corpora | |||||
SYN (version 12) | 5G | ✓ | ✓ | 2010–2023 | versioned corpus, unification of all the SYN-series synchronic written corpora |
SYN2020 | 100M | ✓ | ✓ | 2020 | reference representative corpus, most of the texts are from 2014–2019 |
SYN2015 | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010–2014, with new classification of texts |
SYN2013PUB | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005–2009 |
SYN2010 | 100M | ✓ | ✓ | 2010 | reference representative corpus, most of the texts are from 2005–2009 |
SYN2009PUB | 700M | ✓ | ✓ | 2010 | reference corpus of newspapers and magazines from 1995–2007 |
SYN2006PUB | 300M | ✓ | ✓ | 2006 | reference corpus of newspapers and magazines from 1989–2004 |
SYN2005 | 100M | ✓ | ✓ | 2005 | reference representative corpus, most of the texts are from 2000–2004 |
SYN2000 | 100M | ✓ | ✓ | 2000 | reference representative corpus, most of the texts are from 1990–1999 |
Web corpora | |||||
ONLINE (2nd generation) | > 6G | ✓ | ✓ | 2020 | monitor corpus of Czech internet |
NET (version 2) | 176M | ✓ | ✓ | 2019 | corpus of semi-official internet communication |
Learner corpora | |||||
CzeSL-man | 100k | ✓ | ✓ | 2016 | non-reference learner corpus of non-native Czech speakers with manual error annotation |
CzeSL-plain | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers |
CzeSL-SGT | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
CzeSL-SGT-basic | 960k | ✓ | ✓ | 2019 | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
SKRIPT2012 | 590k | ✓ | ✓ | 2013 | corpus of school essays |
VESPA_CZ | 500k | ✓ | ✓ | 2022 | learner corpus of written academic English by advanced speakers, whose L1 is Czech |
Author corpora | |||||
Capek | 2.3M | ✓ | ✓ | 2007 | author corpus of texts written exclusively by Karel Čapek |
Capek_uplny | 2.5M | ✓ | ✓ | 2007 | author corpus of texts written or co-authored by Karel Čapek |
Cep | 420k | ✓ | ✓ | 2015 | author corpus of prosaic texts written by Jan Čep |
KH-DOPISY | 500k | ✗ | ✗ | 2017 | corpus of Karel Havlíček's correspondence |
KH-NOVINY | 1M | ✗ | ✗ | 2021 | corpus of Karel Havlíček's journalism |
ORWELL | 80k | ✓ | ✓ | 2003 | Orwell's novel 1984, manually annotated |
Specialized corpora | |||||
Etalon | 1.9M | ✓ | ✓ | 2021 | manually annotated corpus of Czech texts |
FicTree | 135k | ✓ | ✓ | 2017 | manually annotated treebank of Czech fiction |
FSC2000 | 100M | ✓ | ✗ | 2004 | modified SYN2000, source of the Frequency Dictionary of Czech |
JEROME | 85M | ✓ | ✓ | 2013 | monolingual comparable corpus for translation studies |
Koditex | 10.8M | ✓ | ✓ | 2018 | corpus for multi-dimensional analysis of Czech registers |
KSK-DOPISY | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990–2004 |
KSP (version 2) | 37.5M | ✓ | ✓ | 2022 | corpus of contemporary Czech poetry published in books and on literary servers from 1990–2020 |
LINK | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts |
Totalita | 12,9M | ✓ | ✓ | 2010 | written language of the communist regime |
Věda | 15M | ✓ | ✓ | 2023 | corpus of scientific Czech, complement to the Phrase Bank of Academic Czech |
Spoken synchronic corpora | |||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
General corpora | |||||
ORATOR (version 2) | 1.2M | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription |
ORTOFON (version 3) | 2.4M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
ORAL (version 1) | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
ORAL2013 | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
ORAL2008 | 1M | ✗ | ✗ | 2008 | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) |
ORAL2006 | 1M | ✗ | ✗ | 2006 | reference corpus of informal spoken Czech (speakers from Bohemia only) |
Specialized corpora | |||||
BMK | 490k | ✗ | ✗ | 2002 | Brno spoken corpus |
DIALEKT (version 2) | 223k | ✓ | ✓ | 2017 | reference dialectal corpus with two-layer transcription |
LINDSEI_CZ | 120k | ✗ | ✗ | 2017 | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
PMK | 675k | ✗ | ✗ | 2001 | Prague spoken corpus |
SCHOLA2010 | 790k | ✗ | ✗ | 2010 | corpus of school lessons |
SPEECHES | 215k | ✗ | ✗ | 2015 | corpus of presidential speeches |
Parlcorp | 38M | ✓ | ✓ | 2015 | corpus of Czech parliamentary speeches (1993-2021) |
Diachronic corpora | |||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
DIAKORP (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC |
OnomOs | 200k | ✓ | ✓ | 2023 | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
Foreign language corpora | |||||
corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
Parallel corpora | |||||
InterCorp (release 16, release 16ud) | 5.3G | (✓) | (✓) | 2008–2024 | versioned parallel corpus for 61 languages |
Psalm 77 | 10k | (✓) | (✓) | 2023 | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
Comparable corpora | |||||
Aranea | 1G | ✓ | ✓ | 2014 | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
deWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of German |
frWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of French |
itWaC | 1.6G | ✓ | ✓ | 2013 | web corpus of Italian |
ukWaC | 1.9G | ✓ | ✓ | 2013 | web corpus of British English |
Specialized foreign language corpora | |||||
Baltische Briefe | 300k | ✓ | ✓ | 2024 | corpus of German historical newspaper Baltische Briefe |
CODIT | 27M | ✗ | ✗ | 2021 | diachronic corpus of Italian covering a period from the 13th century until 1947 |
DOTKO (version 2) | 15.5M | ✓ | ✗ | 2010 | non-reference corpus of Lower Sorbian |
EEBO | 730M | ✗ | ✗ | 2015 | English texts from the period 1475–1700, Early English Books Online |
HOTKO (version 2) | 36M | ✗ | ✗ | 2013 | non-reference corpus of Upper Sorbian |
lEstRepublicain | 73M | ✓ | ✓ | 2013 | corpus of French newspaper L'Est Républicain |
NKJP_1M | 1M | ✓ | ✓ | 2018 | manually annotated one-million subcorpus of the National Corpus of Polish |
OBC | 24M | ✗ | ✓ | 2021 | Old Bailey Corpus, trial proceedings from 1720–1913 |