AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:uvod [2024/05/24 18:46] – [Corpora of the Czech National Corpus project] alexandrrosenen:cnk:uvod [2024/11/18 15:51] (current) michalskrabal
Line 32: Line 32:
 | [[en:cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence | | [[en:cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence |
 | [[en:cnk:kh-noviny|KH-NOVINY]] |  1M |  ✗  |  ✗  |  2021  | corpus of Karel Havlíček's journalism | | [[en:cnk:kh-noviny|KH-NOVINY]] |  1M |  ✗  |  ✗  |  2021  | corpus of Karel Havlíček's journalism |
 +| [[en:cnk:klaus|Klaus]] |  1.5M |  ✓  |  ✓  |  2024  | corpus of Václav Klaus' texts |
 | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  | | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
Line 40: Line 41:
 | [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers | | [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers |
 | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 | | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 |
-| [[en:cnk:ksp|KSP]] |  35.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |+| [[en:cnk:ksp|KSP]] (version 2) |  37.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
 | [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  | written language of the communist regime | | [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  | written language of the communist regime |
Line 48: Line 49:
 | **General corpora** |||||| | **General corpora** ||||||
 | [[en:cnk:orator|ORATOR]] (version 2) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription | | [[en:cnk:orator|ORATOR]] (version 2) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |
-| [[en:cnk:ortofon|ORTOFON]] (version 2) |  2.1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |+| [[en:cnk:ortofon|ORTOFON]] (version 3) |  2.4M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
Line 68: Line 69:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13ud|release 13ud]], [[en:cnk:intercorp:verze16|release 16]], [[en:cnk:intercorp:verze16ud|release 16ud]]) |  5.3G |  (✓)  |  (✓)  |  2008–2023  | versioned parallel corpus for 61 languages |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze16|release 16]], [[en:cnk:intercorp:verze16ud|release 16ud]]) |  5.3G |  (✓)  |  (✓)  |  2008–2024  | versioned parallel corpus for 61 languages |
 | [[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek | | [[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
Line 77: Line 78:
 | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English | | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English |
 | **Specialized foreign language corpora** |||||| | **Specialized foreign language corpora** ||||||
 +| [[en:cnk:baltischebriefe|Baltische Briefe]] |  300k |  ✓  |  ✓  |  2024  | corpus of German historical newspaper Baltische Briefe |
 | [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 | | [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 |
 | [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓  |  ✗  |  2010  | non-reference corpus of Lower Sorbian | | [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓  |  ✗  |  2010  | non-reference corpus of Lower Sorbian |