AplikaceAplikace
Nastavení
LDAP: couldn't connect to LDAP server

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:uvod [2023/10/11 17:54] – [Corpora of the Czech National Corpus project] alexandrrosenen:cnk:uvod [2025/03/17 16:56] (current) – [Corpora of the Czech National Corpus project] michalkren
Line 7: Line 7:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze11|version 11]]) |  5G |  ✓  |  ✓  |  2010–2022  | versioned corpus, unification of all the SYN-series synchronic written corpora |+| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze13|version 13]]) |  5.3G |  ✓  |  ✓  |  2010–2024  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 | [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 | | [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
Line 32: Line 32:
 | [[en:cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence | | [[en:cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence |
 | [[en:cnk:kh-noviny|KH-NOVINY]] |  1M |  ✗  |  ✗  |  2021  | corpus of Karel Havlíček's journalism | | [[en:cnk:kh-noviny|KH-NOVINY]] |  1M |  ✗  |  ✗  |  2021  | corpus of Karel Havlíček's journalism |
 +| [[en:cnk:klaus|Klaus]] |  1.5M |  ✓  |  ✓  |  2024  | corpus of Václav Klaus' texts |
 | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  | | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
Line 40: Line 41:
 | [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers | | [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers |
 | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 | | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 |
-| [[en:cnk:ksp|KSP]] |  35.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |+| [[en:cnk:ksp|KSP]] (version 2) |  37.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
 | [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  | written language of the communist regime | | [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  | written language of the communist regime |
 +| [[en:cnk:veda|Věda]] |  15M |  ✓  |  ✓  |  2023  | corpus of scientific Czech, complement to the [[https://db.korpus.cz/search/acphrase|Phrase Bank of Academic Czech]] |
 ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^ ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
 | [[en:cnk:orator|ORATOR]] (version 2) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription | | [[en:cnk:orator|ORATOR]] (version 2) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |
-| [[en:cnk:ortofon|ORTOFON]] (version 2) |  2.1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |+| [[en:cnk:ortofon|ORTOFON]] (version 3) |  2.4M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
Line 63: Line 65:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC | | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC |
 +| [[en:cnk:onomos|OnomOs]] |  200k |  ✓  |  ✓  |  2023  | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
 ^ <fs large>Foreign language corpora</fs> ^^^^^^ ^ <fs large>Foreign language corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13ud|release 13ud]], [[en:cnk:intercorp:verze15|release 15]], [[en:cnk:intercorp:verze16|release 16]]) |  5.3G |  (✓)  |  (✓)  |  2008–2023  | versioned parallel corpus for 61 languages |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze16|release 16]], [[en:cnk:intercorp:verze16ud|release 16ud]]) |  5.3G |  (✓)  |  (✓)  |  2008–2024  | versioned parallel corpus for 61 languages |
 | [[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek | | [[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
Line 75: Line 78:
 | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English | | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English |
 | **Specialized foreign language corpora** |||||| | **Specialized foreign language corpora** ||||||
 +| [[en:cnk:baltischebriefe|Baltische Briefe]] |  300k |  ✓  |  ✓  |  2024  | corpus of German historical newspaper Baltische Briefe |
 | [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 | | [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 |
 | [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓  |  ✗  |  2010  | non-reference corpus of Lower Sorbian | | [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓  |  ✗  |  2010  | non-reference corpus of Lower Sorbian |
-| [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗   2015  | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] |+| [[en:cnk:eebo|EEBO]] (version 2) |  1.3G |  ✓  |  ✓   2015  | English texts from the period 1475--1700, [[https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/|Early English Books Online]] |
 | [[en:cnk:hotko|HOTKO]] (version 2) |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian | | [[en:cnk:hotko|HOTKO]] (version 2) |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian |
 | [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain | | [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain |
 | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish | | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish |
 | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 | | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 |