~~NOTOC~~
====== Corpora of the Czech National Corpus project ======
^ Written synchronic corpora ^^^^^^
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.)) ^ characteristic features ^
| **General corpora** ||||||
| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze12|version 12]]) | 5G | ✓ | ✓ | 2010–2023 | versioned corpus, unification of all the SYN-series synchronic written corpora |
| [[en:cnk:syn2020|SYN2020]] | 100M | ✓ | ✓ | 2020 | reference representative corpus, most of the texts are from 2014--2019 |
| [[en:cnk:syn2015|SYN2015]] | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
| [[en:cnk:syn2013PUB|SYN2013PUB]] | 935M | ✓ | ✓ | 2013 | reference corpus of newspapers and magazines from 2005--2009 |
| [[en:cnk:syn2010|SYN2010]] | 100M | ✓ | ✓ | 2010 | reference representative corpus, most of the texts are from 2005--2009 |
| [[en:cnk:syn2009PUB|SYN2009PUB]] | 700M | ✓ | ✓ | 2010 | reference corpus of newspapers and magazines from 1995--2007 |
| [[en:cnk:syn2006PUB|SYN2006PUB]] | 300M | ✓ | ✓ | 2006 | reference corpus of newspapers and magazines from 1989--2004|
| [[en:cnk:syn2005|SYN2005]] | 100M | ✓ | ✓ | 2005 | reference representative corpus, most of the texts are from 2000--2004 |
| [[en:cnk:syn2000|SYN2000]] | 100M | ✓ | ✓ | 2000 | reference representative corpus, most of the texts are from 1990--1999 |
| **Web corpora** ||||||
| [[en:cnk:online|ONLINE]] ([[en:cnk:online:gen2|2nd generation]]) | > 6G | ✓ | ✓ | 2020 | monitor corpus of Czech internet |
| [[en:cnk:net|NET]] (version 2) | 176M | ✓ | ✓ | 2019 | corpus of semi-official internet communication |
| **Learner corpora** ||||||
| [[en:cnk:czesl-man|CzeSL-man]] | 100k | ✓ | ✓ | 2016 | non-reference learner corpus of non-native Czech speakers with manual error annotation |
| [[en:cnk:czesl-plain|CzeSL-plain]] | 2M | ✗ | ✗ | 2012 | non-reference learner corpus of non-native Czech speakers |
| [[en:cnk:czesl-sgt|CzeSL-SGT]] | 960k | ✓ | ✓ | 2014 | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
| [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] | 960k | ✓ | ✓ | 2019 | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
| [[en:cnk:skript2012|SKRIPT2012]] | 590k | ✓ | ✓ | 2013 | corpus of school essays |
| [[en:cnk:vespa_cz|VESPA_CZ]] | 500k | ✓ | ✓ | 2022 | learner corpus of written academic English by advanced speakers, whose L1 is Czech |
| **Author corpora** ||||||
| [[en:cnk:capek|Capek]] | 2.3M | ✓ | ✓ | 2007 | author corpus of texts written exclusively by Karel Čapek |
| [[en:cnk:capek|Capek_uplny]] | 2.5M | ✓ | ✓ | 2007 | author corpus of texts written or co-authored by Karel Čapek |
| [[en:cnk:cep|Cep]] | 420k | ✓ | ✓ | 2015 | author corpus of prosaic texts written by Jan Čep |
| [[en:cnk:kh-dopisy|KH-DOPISY]] | 500k | ✗ | ✗ | 2017 | corpus of Karel Havlíček's correspondence |
| [[en:cnk:kh-noviny|KH-NOVINY]] | 1M | ✗ | ✗ | 2021 | corpus of Karel Havlíček's journalism |
| [[en:cnk:klaus|Klaus]] | 1.5M | ✓ | ✓ | 2024 | corpus of Václav Klaus' texts |
| [[en:cnk:orwell|ORWELL]] | 80k | ✓ | ✓ | 2003 | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated |
| **Specialized corpora** ||||||
| [[en:cnk:etalon|Etalon]] | 1.9M | ✓ | ✓ | 2021 | manually annotated corpus of Czech texts |
| [[en:cnk:fictree|FicTree]] | 135k | ✓ | ✓ | 2017 | manually annotated treebank of Czech fiction |
| [[en:cnk:fsc2000|FSC2000]] | 100M | ✓ | ✗ | 2004 | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
| [[en:cnk:jerome|JEROME]] | 85M | ✓ | ✓ | 2013 | monolingual comparable corpus for translation studies |
| [[en:cnk:koditex|Koditex]] | 10.8M | ✓ | ✓ | 2018 | corpus for multi-dimensional analysis of Czech registers |
| [[en:cnk:ksk-dopisy|KSK-DOPISY]] | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990--2004 |
| [[en:cnk:ksp|KSP]] (version 2) | 37.5M | ✓ | ✓ | 2022 | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |
| [[en:cnk:link|LINK]] | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts |
| [[en:cnk:totalita|Totalita]] | 12,9M | ✓ | ✓ | 2010 | written language of the communist regime |
| [[en:cnk:veda|Věda]] | 15M | ✓ | ✓ | 2023 | corpus of scientific Czech, complement to the [[https://db.korpus.cz/search/acphrase|Phrase Bank of Academic Czech]] |
^ Spoken synchronic corpora ^^^^^^
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^
| **General corpora** ||||||
| [[en:cnk:orator|ORATOR]] (version 2) | 1.2M | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription |
| [[en:cnk:ortofon|ORTOFON]] (version 3) | 2.4M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
| [[en:cnk:oral|ORAL]] (version 1) | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
| [[en:cnk:oral2013|ORAL2013]] | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
| [[en:cnk:oral2008|ORAL2008]] | 1M | ✗ | ✗ | 2008 | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) |
| [[en:cnk:oral2006|ORAL2006]] | 1M | ✗ | ✗ | 2006 | reference corpus of informal spoken Czech (speakers from Bohemia only) |
| **Specialized corpora** ||||||
| [[en:cnk:bmk|BMK]] | 490k | ✗ | ✗ | 2002 | Brno spoken corpus |
| [[en:cnk:dialekt|DIALEKT]] (version 2) | 223k | ✓ | ✓ | 2017 | reference dialectal corpus with two-layer transcription |
| [[en:cnk:lindsei_cz|LINDSEI_CZ]] | 120k | ✗ | ✗ | 2017 | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
| [[en:cnk:pmk|PMK]] | 675k | ✗ | ✗ | 2001 | Prague spoken corpus |
| [[en:cnk:schola2010|SCHOLA2010]] | 790k | ✗ | ✗ | 2010 | corpus of school lessons |
| [[en:cnk:speeches|SPEECHES]] | 215k | ✗ | ✗ | 2015 | corpus of presidential speeches |
| [[en:cnk:parlcorp|Parlcorp]] | 38M | ✓ | ✓ | 2015 | corpus of Czech parliamentary speeches (1993-2021) |
^ Diachronic corpora ^^^^^^
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^
| [[en:cnk:diakorp|DIAKORP]] (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC |
| [[en:cnk:onomos|OnomOs]] | 200k | ✓ | ✓ | 2023 | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
^ Foreign language corpora ^^^^^^
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^
| **Parallel corpora** ||||||
| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze16|release 16]], [[en:cnk:intercorp:verze16ud|release 16ud]]) | 5.3G | (✓) | (✓) | 2008–2024 | versioned parallel corpus for 61 languages |
| [[en:cnk:psalm77|Psalm 77]] | 10k | (✓) | (✓) | 2023 | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
| **Comparable corpora** ||||||
| [[en:cnk:aranea|Aranea]] | 1G | ✓ | ✓ | 2014 | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
| [[en:cnk:dewac|deWaC]] | 1.35G | ✓ | ✓ | 2013 | web corpus of German |
| [[en:cnk:frwac|frWaC]] | 1.35G | ✓ | ✓ | 2013 | web corpus of French |
| [[en:cnk:itwac|itWaC]] | 1.6G | ✓ | ✓ | 2013 | web corpus of Italian |
| [[en:cnk:ukwac|ukWaC]] | 1.9G | ✓ | ✓ | 2013 | web corpus of British English |
| **Specialized foreign language corpora** ||||||
| [[en:cnk:baltischebriefe|Baltische Briefe]] | 300k | ✓ | ✓ | 2024 | corpus of German historical newspaper Baltische Briefe |
| [[en:cnk:codit|CODIT]] | 27M | ✗ | ✗ | 2021 | diachronic corpus of Italian covering a period from the 13th century until 1947 |
| [[en:cnk:dotko|DOTKO]] (version 2) | 15.5M | ✓ | ✗ | 2010 | non-reference corpus of Lower Sorbian |
| [[en:cnk:eebo|EEBO]] | 730M | ✗ | ✗ | 2015 | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] |
| [[en:cnk:hotko|HOTKO]] (version 2) | 36M | ✗ | ✗ | 2013 | non-reference corpus of Upper Sorbian |
| [[en:cnk:lEstRepublicain|lEstRepublicain]] | 73M | ✓ | ✓ | 2013 | corpus of French newspaper L'Est Républicain |
| [[en:cnk:nkjp|NKJP_1M]] | 1M | ✓ | ✓ | 2018 | manually annotated one-million subcorpus of the National Corpus of Polish |
| [[en:cnk:obc|OBC]] | 24M | ✗ | ✓ | 2021 | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 |