This is an old revision of the document!
Corpora of the Czech National Corpus project
| Written synchronic corpora | |||||
|---|---|---|---|---|---|
| corpus | size (word count) | lemmas | morphological tags | released1) | characteristic features |
| General corpora | |||||
| SYN (version 7) | 4.255G | ✓ | ✓ | 2010 | versioned corpus, unification of all the SYN-series synchronic written corpora |
| SYN2015 | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010–2014, with new classification of texts |
| SYN2013PUB | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005–2009 |
| SYN2010 | 100M | ✓ | ✓ | 2010 | reference representative corpus, most of the texts are from 2005–2009 |
| SYN2009PUB | 700M | ✓ | ✓ | 2010 | reference corpus of newspapers and magazines from 1995–2007 |
| SYN2006PUB | 300M | ✓ | ✓ | 2006 | reference corpus of newspapers and magazines from 1989–2004 |
| SYN2005 | 100M | ✓ | ✓ | 2005 | reference representative corpus, most of the texts are from 2000–2004 |
| SYN2000 | 100M | ✓ | ✓ | 2000 | reference representative corpus, most of the texts are from 1990–1999 |
| Specialized corpora | |||||
| CZESL-PLAIN | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers |
| CZESL-SGT | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
| FicTree | 135k | ✓ | ✓ | 2017 | manually annotated treebank of Czech fiction |
| FSC2000 | 100M | ✓ | ✗ | 2004 | modified SYN2000, source of the Frequency Dictionary of Czech |
| JEROME | 85M | ✓ | ✓ | 2013 | monolingual comparable corpus for translation studies |
| Koditex | 10.8 mil. | ✓ | ✓ | 2018 | corpus for multi-dimensional analysis of Czech registers |
| KSK-DOPISY | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990–2004 |
| LINK | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts |
| ORWELL | 80k | ✓ | ✓ | 2003 | Orwell's novel 1984, manually annotated |
| SKRIPT2012 | 590k | ✓ | ✓ | 2013 | corpus of school essays |
| Spoken synchronic corpora | |||||
| corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
| General corpora | |||||
| ORTOFON | 1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
| ORAL | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
| ORAL2013 | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
| ORAL2008 | 1M | ✗ | ✗ | 2008 | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) |
| ORAL2006 | 1M | ✗ | ✗ | 2006 | reference corpus of informal spoken Czech (speakers from Bohemia only) |
| Specialized corpora | |||||
| BMK | 490k | ✗ | ✗ | 2002 | Brno spoken corpus |
| DIALEKT | 100k | ✓ | ✓ | 2017 | reference dialectal corpus with two-layer transcription |
| LINDSEI_CZ | 120k | ✗ | ✗ | 2017 | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
| PMK | 675k | ✗ | ✗ | 2001 | Prague spoken corpus |
| SCHOLA2010 | 790k | ✗ | ✗ | 2010 | corpus of school lessons |
| SPEECHES | 215k | ✗ | ✗ | 2015 | corpus of presidential speeches |
| Diachronic corpora | |||||
| corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
| DIAKORP (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC |
| Foreign language corpora | |||||
| corpus | size (word count) | lemmas | morphological tags | year | characteristic features |
| Parallel corpora | |||||
| InterCorp (version 11) | 1.7G | (✓) | (✓) | 2008 | versioned parallel corpus being compiled as a part of the InterCorp project |
| Comparable corpora | |||||
| Aranea | 1G | ✓ | ✓ | 2014 | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
| deWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of German |
| frWaC | 1.35G | ✓ | ✓ | 2013 | web corpus of French |
| itWaC | 1.6G | ✓ | ✓ | 2013 | web corpus of Italian |
| ukWaC | 1.9G | ✓ | ✓ | 2013 | web corpus of British English |
| Specialized foreign language corpora | |||||
| DOTKO | 12M | ✗ | ✗ | 2010 | non-reference corpus of Lower Sorbian, most of the texts are from 1848–1933 |
| EEBO | 730M | ✗ | ✗ | 2015 | English texts from the period 1475–-1700, Early English Books Online |
| HOTKO | 36M | ✗ | ✗ | 2013 | non-reference corpus of Upper Sorbian |
| lEstRepublicain | 73M | ✓ | ✓ | 2013 | corpus of French newspaper L'Est Républicain |
| NKJP_1M | 1M | ✓ | ✓ | 2018 | manually annotated one-million subcorpus of the National Corpus of Polish |