Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:uvod [2023/09/27 13:35] – [Corpora of the Czech National Corpus project] michalkren | en:cnk:uvod [2024/11/18 15:51] (current) – michalskrabal |
---|
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.)) ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.)) ^ characteristic features ^ |
| **General corpora** |||||| | | **General corpora** |||||| |
| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze11|version 11]]) | 5G | ✓ | ✓ | 2010–2022 | versioned corpus, unification of all the SYN-series synchronic written corpora | | | [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze12|version 12]]) | 5G | ✓ | ✓ | 2010–2023 | versioned corpus, unification of all the SYN-series synchronic written corpora | |
| [[en:cnk:syn2020|SYN2020]] | 100M | ✓ | ✓ | 2020 | reference representative corpus, most of the texts are from 2014--2019 | | | [[en:cnk:syn2020|SYN2020]] | 100M | ✓ | ✓ | 2020 | reference representative corpus, most of the texts are from 2014--2019 | |
| [[en:cnk:syn2015|SYN2015]] | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | | [[en:cnk:syn2015|SYN2015]] | 100M | ✓ | ✓ | 2015 | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | |
| [[en:cnk:kh-dopisy|KH-DOPISY]] | 500k | ✗ | ✗ | 2017 | corpus of Karel Havlíček's correspondence | | | [[en:cnk:kh-dopisy|KH-DOPISY]] | 500k | ✗ | ✗ | 2017 | corpus of Karel Havlíček's correspondence | |
| [[en:cnk:kh-noviny|KH-NOVINY]] | 1M | ✗ | ✗ | 2021 | corpus of Karel Havlíček's journalism | | | [[en:cnk:kh-noviny|KH-NOVINY]] | 1M | ✗ | ✗ | 2021 | corpus of Karel Havlíček's journalism | |
| | [[en:cnk:klaus|Klaus]] | 1.5M | ✓ | ✓ | 2024 | corpus of Václav Klaus' texts | |
| [[en:cnk:orwell|ORWELL]] | 80k | ✓ | ✓ | 2003 | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated | | | [[en:cnk:orwell|ORWELL]] | 80k | ✓ | ✓ | 2003 | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated | |
| **Specialized corpora** |||||| | | **Specialized corpora** |||||| |
| [[en:cnk:koditex|Koditex]] | 10.8M | ✓ | ✓ | 2018 | corpus for multi-dimensional analysis of Czech registers | | | [[en:cnk:koditex|Koditex]] | 10.8M | ✓ | ✓ | 2018 | corpus for multi-dimensional analysis of Czech registers | |
| [[en:cnk:ksk-dopisy|KSK-DOPISY]] | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990--2004 | | | [[en:cnk:ksk-dopisy|KSK-DOPISY]] | 800k | ✗ | ✗ | 2006 | transcriptions of handwritten correspondence from 1990--2004 | |
| [[en:cnk:ksp|KSP]] | 35.5M | ✓ | ✓ | 2022 | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 | | | [[en:cnk:ksp|KSP]] (version 2) | 37.5M | ✓ | ✓ | 2022 | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 | |
| [[en:cnk:link|LINK]] | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts | | | [[en:cnk:link|LINK]] | 1.8M | ✓ | ✓ | 2010 | non-reference corpus of linguistic texts | |
| [[en:cnk:totalita|Totalita]] | 12,9M | ✓ | ✓ | 2010 | written language of the communist regime | | | [[en:cnk:totalita|Totalita]] | 12,9M | ✓ | ✓ | 2010 | written language of the communist regime | |
| | [[en:cnk:veda|Věda]] | 15M | ✓ | ✓ | 2023 | corpus of scientific Czech, complement to the [[https://db.korpus.cz/search/acphrase|Phrase Bank of Academic Czech]] | |
^ <fs large>Spoken synchronic corpora</fs> ^^^^^^ | ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^ |
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| **General corpora** |||||| | | **General corpora** |||||| |
| [[en:cnk:orator|ORATOR]] (version 2) | 1.2M | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription | | | [[en:cnk:orator|ORATOR]] (version 2) | 1.2M | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription | |
| [[en:cnk:ortofon|ORTOFON]] (version 2) | 2.1M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:ortofon|ORTOFON]] (version 3) | 2.4M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | |
| [[en:cnk:oral|ORAL]] (version 1) | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:oral|ORAL]] (version 1) | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |
| [[en:cnk:oral2013|ORAL2013]] | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:oral2013|ORAL2013]] | 2.8M | ✗ | ✗ | 2013 | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| [[en:cnk:diakorp|DIAKORP]] (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC | | | [[en:cnk:diakorp|DIAKORP]] (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC | |
| | [[en:cnk:onomos|OnomOs]] | 200k | ✓ | ✓ | 2023 | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation | |
^ <fs large>Foreign language corpora</fs> ^^^^^^ | ^ <fs large>Foreign language corpora</fs> ^^^^^^ |
^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| **Parallel corpora** |||||| | | **Parallel corpora** |||||| |
| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13ud|release 13ud]], [[en:cnk:intercorp:verze15|release 15]]) | 1.8G | (✓) | (✓) | 2008–2022 | versioned parallel corpus for 41 languages | | | [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze16|release 16]], [[en:cnk:intercorp:verze16ud|release 16ud]]) | 5.3G | (✓) | (✓) | 2008–2024 | versioned parallel corpus for 61 languages | |
| [[en:cnk:psalm77|Psalm 77]] | 10k | (✓) | (✓) | 2023 | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek | | | [[en:cnk:psalm77|Psalm 77]] | 10k | (✓) | (✓) | 2023 | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek | |
| **Comparable corpora** |||||| | | **Comparable corpora** |||||| |
| [[en:cnk:ukwac|ukWaC]] | 1.9G | ✓ | ✓ | 2013 | web corpus of British English | | | [[en:cnk:ukwac|ukWaC]] | 1.9G | ✓ | ✓ | 2013 | web corpus of British English | |
| **Specialized foreign language corpora** |||||| | | **Specialized foreign language corpora** |||||| |
| | [[en:cnk:baltischebriefe|Baltische Briefe]] | 300k | ✓ | ✓ | 2024 | corpus of German historical newspaper Baltische Briefe | |
| [[en:cnk:codit|CODIT]] | 27M | ✗ | ✗ | 2021 | diachronic corpus of Italian covering a period from the 13th century until 1947 | | | [[en:cnk:codit|CODIT]] | 27M | ✗ | ✗ | 2021 | diachronic corpus of Italian covering a period from the 13th century until 1947 | |
| [[en:cnk:dotko|DOTKO]] (version 2) | 15.5M | ✓ | ✗ | 2010 | non-reference corpus of Lower Sorbian | | | [[en:cnk:dotko|DOTKO]] (version 2) | 15.5M | ✓ | ✗ | 2010 | non-reference corpus of Lower Sorbian | |