AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:uvod [2017/06/02 14:18] – [Corpora of the Czech National Corpus project] Michal Křenen:cnk:uvod [2024/02/29 21:00] (current) Michal Křen
Line 5: Line 5:
  
 ^ <fs large>Written synchronic corpora </fs> ^^^^^^ ^ <fs large>Written synchronic corpora </fs> ^^^^^^
-^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^+^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze5|version 5]]) |  3.836G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |+| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze12|version 12]]) |  5G |  ✓  |  ✓  |  2010–2023  | versioned corpus, unification of all the SYN-series synchronic written corpora 
 +| [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
 | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 |
Line 15: Line 16:
 | [[en:cnk:syn2005|SYN2005]] |  100M |  ✓  |  ✓  |  2005  | reference representative corpus, most of the texts are from 2000--2004  | | [[en:cnk:syn2005|SYN2005]] |  100M |  ✓  |  ✓  |  2005  | reference representative corpus, most of the texts are from 2000--2004  |
 | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 | | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 |
 +| **Web corpora** ||||||
 +| [[en:cnk:online|ONLINE]] ([[en:cnk:online:gen2|2nd generation]]) |  > 6G |  ✓  |  ✓  |  2020  | monitor corpus of Czech internet |
 +| [[en:cnk:net|NET]] (version 2) |  176M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication |
 +| **Learner corpora** ||||||
 +| [[en:cnk:czesl-man|CzeSL-man]] |  100k |  ✓  |  ✓  |  2016  | non-reference learner corpus of non-native Czech speakers with manual error annotation  |
 +| [[en:cnk:czesl-plain|CzeSL-plain]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers  |
 +| [[en:cnk:czesl-sgt|CzeSL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
 +| [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] |  960k |  ✓  |  ✓  |  2019  | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
 +| [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays |
 +| [[en:cnk:vespa_cz|VESPA_CZ]] |  500k |  ✓  |  ✓  |  2022  | learner corpus of written academic English by advanced speakers, whose L1 is Czech |
 +| **Author corpora** ||||||
 +| [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek |
 +| [[en:cnk:capek|Capek_uplny]] |  2.5M |  ✓  |  ✓  |  2007  | author corpus of texts written or co-authored by Karel Čapek |
 +| [[en:cnk:cep|Cep]] |  420k |  ✓  |  ✓  |  2015  | author corpus of prosaic texts written by Jan Čep |
 +| [[en:cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence |
 +| [[en:cnk:kh-noviny|KH-NOVINY]] |  1M |  ✗  |  ✗  |  2021  | corpus of Karel Havlíček's journalism |
 +| [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
-| [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  non-reference learner corpus of non-native Czech speakers  +| [[en:cnk:etalon|Etalon]] |  1.9M |  ✓  |  ✓  |  2021  manually annotated corpus of Czech texts 
-| [[en:cnk:czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  non-reference learner corpus of non-native speakers’ Czech with automatic annotation |+| [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  manually annotated treebank of Czech fiction |
 | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
-| [[JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies | +| [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies 
-| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|+| [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers 
 +| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 
 +| [[en:cnk:ksp|KSP]] |  35.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
-| [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  +| [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  written language of the communist regime 
-| [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays |+| [[en:cnk:veda|Věda]] |  15M |  ✓  |  ✓  |  2023  | corpus of scientific Czech, complement to the [[https://db.korpus.cz/search/acphrase|Phrase Bank of Academic Czech]] |
 ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^ ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | +| [[en:cnk:orator|ORATOR]] (version 2) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription | 
-| [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |+| [[en:cnk:ortofon|ORTOFON]] (version 2) |  2.1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | 
 +| [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral2008|ORAL2008]] |  1M |  ✗  |  ✗  |  2008  | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) | | [[en:cnk:oral2008|ORAL2008]] |  1M |  ✗  |  ✗  |  2008  | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) |
Line 34: Line 55:
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
 | [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus | | [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus |
-| [[en:cnk:dialekt|DIALEKT]] |  100k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |+| [[en:cnk:dialekt|DIALEKT]] (version 2) |  223k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |
 | [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech | | [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
 | [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus | | [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus |
 | [[en:cnk:schola2010|SCHOLA2010]] |  790k |  ✗  |  ✗  |  2010  | corpus of school lessons | | [[en:cnk:schola2010|SCHOLA2010]] |  790k |  ✗  |  ✗  |  2010  | corpus of school lessons |
 | [[en:cnk:speeches|SPEECHES]] |  215k |  ✗  |  ✗  |  2015  | corpus of presidential speeches | | [[en:cnk:speeches|SPEECHES]] |  215k |  ✗  |  ✗  |  2015  | corpus of presidential speeches |
 +| [[en:cnk:parlcorp|Parlcorp]] |  38M |  ✓  |  ✓  |  2015  | corpus of Czech parliamentary speeches (1993-2021) |
 ^ <fs large>Diachronic corpora</fs> ^^^^^^ ^ <fs large>Diachronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC | | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC |
 +| [[en:cnk:onomos|OnomOs]] |  200k |  ✓  |  ✓  |  2023  | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
 ^ <fs large>Foreign language corpora</fs> ^^^^^^ ^ <fs large>Foreign language corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze9|version 9]]) |  1.46G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13ud|release 13ud]], [[en:cnk:intercorp:verze15|release 15]], [[en:cnk:intercorp:verze16|release 16]]) |  5.3G |  (✓)  |  (✓)  |  2008–2023  | versioned parallel corpus for 61 languages | 
 +[[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
-| [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |+| [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
 | [[en:cnk:dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German | | [[en:cnk:dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German |
 | [[en:cnk:frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French | | [[en:cnk:frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French |
Line 53: Line 77:
 | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English | | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English |
 | **Specialized foreign language corpora** |||||| | **Specialized foreign language corpora** ||||||
-| [[en:cnk:dotko|DOTKO]] |  12M |  ✗   ✗  |  2010  | non-reference corpus of Lower Sorbian, most of the texts are from 1848--1933 +| [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 | 
-| [[cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475-1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | +| [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓   ✗  |  2010  | non-reference corpus of Lower Sorbian | 
-| [[en:cnk:hotko|HOTKO]] |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian |+| [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | 
 +| [[en:cnk:hotko|HOTKO]] (version 2) |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian |
 | [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain | | [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain |
 +| [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish |
 +| [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 |