AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
en:cnk:uvod [2016/12/16 15:55] – rozlišuju balanced/representative podle české verze, ok? michalskrabalen:cnk:uvod [2023/12/29 12:24] michalkren
Line 5: Line 5:
  
 ^ <fs large>Written synchronic corpora </fs> ^^^^^^ ^ <fs large>Written synchronic corpora </fs> ^^^^^^
-^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^+^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze4|version 4]]) |  3,626 bil. |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora | +| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze12|version 12]]) |  5G |  ✓  |  ✓  |  2010–2023  | versioned corpus, unification of all the SYN-series synchronic written corpora 
-| [[en:cnk:syn2015|SYN2015]] |  100 mil. |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | +| [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 
-| [[en:cnk:syn2013PUB|SYN2013PUB]] |  935 mil. |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | +| [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | 
-| [[en:cnk:syn2010|SYN2010]] |  100 mil. |  ✓  |  ✓  |  2010  | reference representative corpus, most of the texts are from 2005--2009 | +| [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | 
-| [[en:cnk:syn2009PUB|SYN2009PUB]] |  700 mil. |  ✓  |  ✓  |  2010  | reference corpus of newspapers and magazines from 1995--2007 | +| [[en:cnk:syn2010|SYN2010]] |  100M |  ✓  |  ✓  |  2010  | reference representative corpus, most of the texts are from 2005--2009 | 
-| [[en:cnk:syn2006PUB|SYN2006PUB]] |  300 mil. |  ✓  |  ✓  |  2006  | reference corpus of newspapers and magazines from 1989--2004| +| [[en:cnk:syn2009PUB|SYN2009PUB]] |  700M |  ✓  |  ✓  |  2010  | reference corpus of newspapers and magazines from 1995--2007 | 
-| [[en:cnk:syn2005|SYN2005]] |  100 mil. |  ✓  |  ✓  |  2005  | reference representative corpus, most of the texts are from 2000--2004 +| [[en:cnk:syn2006PUB|SYN2006PUB]] |  300M |  ✓  |  ✓  |  2006  | reference corpus of newspapers and magazines from 1989--2004| 
-| [[en:cnk:syn2000|SYN2000]] |  100 mil. |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 |+| [[en:cnk:syn2005|SYN2005]] |  100M |  ✓  |  ✓  |  2005  | reference representative corpus, most of the texts are from 2000--2004 
 +| [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 
 +| **Web corpora** |||||| 
 +| [[en:cnk:online|ONLINE]] ([[en:cnk:online:gen2|2nd generation]]) |  > 6G |  ✓  |  ✓  |  2020  | monitor corpus of Czech internet | 
 +| [[en:cnk:net|NET]] (version 2) |  176M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication | 
 +| **Learner corpora** |||||| 
 +| [[en:cnk:czesl-man|CzeSL-man]] |  100k |  ✓  |  ✓  |  2016  | non-reference learner corpus of non-native Czech speakers with manual error annotation 
 +| [[en:cnk:czesl-plain|CzeSL-plain]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers 
 +| [[en:cnk:czesl-sgt|CzeSL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | 
 +| [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] |  960k |  ✓  |  ✓  |  2019  | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface | 
 +| [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays | 
 +| [[en:cnk:vespa_cz|VESPA_CZ]] |  500k |  ✓  |  ✓  |  2022  | learner corpus of written academic English by advanced speakers, whose L1 is Czech | 
 +| **Author corpora** |||||| 
 +| [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek | 
 +| [[en:cnk:capek|Capek_uplny]] |  2.5M |  ✓  |  ✓  |  2007  | author corpus of texts written or co-authored by Karel Čapek | 
 +| [[en:cnk:cep|Cep]] |  420k |  ✓  |  ✓  |  2015  | author corpus of prosaic texts written by Jan Čep | 
 +| [[en:cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence | 
 +| [[en:cnk:kh-noviny|KH-NOVINY]] |  1M |  ✗  |  ✗  |  2021  | corpus of Karel Havlíček's journalism | 
 +| [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
-| [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2 mil. |  ✗  |  ✗  |  2012  non-reference learner corpus of non-native Czech speakers  +| [[en:cnk:etalon|Etalon]] |  1.9M |  ✓  |  ✓  |  2021  manually annotated corpus of Czech texts 
-| [[en:cnk:czesl-sgt|CZESL-SGT]] |  960 000 |  ✓  |  ✓  |  2014  non-reference corpus of non-native speakers’ Czech with automatic annotation +| [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  manually annotated treebank of Czech fiction 
-| [[en:cnk:fsc2000|FSC2000]] |  100 mil. |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | +| [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | 
-| [[JEROME]] |  85 mil. |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies | +| [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies 
-| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800 000 |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004| +| [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers 
-| [[en:cnk:link|LINK]] |  1,8 mil. |  ✓  |  ✓  |  2010  non-reference corpus of linguistic texts +| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 | 
-| [[en:cnk:orwell|ORWELL]] |  80 000 |  ✓  |  ✓  |  2003  Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  +| [[en:cnk:ksp|KSP]] |  35.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 
-| [[en:cnk:skript2012|SKRIPT2012]] |  590 000 |  ✓  |  ✓  |  2013  corpus of school essays |+| [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  non-reference corpus of linguistic texts 
 +| [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  written language of the communist regime |
 ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^ ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:oral2013|ORAL2013]] |  2,8 mil. |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech  +| [[en:cnk:orator|ORATOR]] (version 2) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription | 
-| [[en:cnk:oral2008|ORAL2008]] |  1 mil. |  ✗  |  ✗  |  2008  | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) | +| [[en:cnk:ortofon|ORTOFON]] (version 2) |  2.1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | 
-| [[en:cnk:oral2006|ORAL2006]] |  1 mil. |  ✗  |  ✗  |  2006  | reference corpus of informal spoken Czech (speakers from Bohemia only) |+| [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | 
 +| [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) 
 +| [[en:cnk:oral2008|ORAL2008]] |  1M |  ✗  |  ✗  |  2008  | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) | 
 +| [[en:cnk:oral2006|ORAL2006]] |  1M |  ✗  |  ✗  |  2006  | reference corpus of informal spoken Czech (speakers from Bohemia only) |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
-| [[en:cnk:bmk|BMK]] |  490 000 |  ✗  |  ✗  |  2002  | Brno spoken corpus | +| [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus 
-| [[en:cnk:pmk|PMK]] |  675 000 |  ✗  |  ✗  |  2001  | Prague spoken corpus | +| [[en:cnk:dialekt|DIALEKT]] (version 2) |  223k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription | 
-| [[en:cnk:schola2010|SCHOLA2010]] |  790 000 |  ✗  |  ✗  |  2010  | corpus of school lessons | +| [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech 
-| [[en:cnk:speeches|SPEECHES]] |  215 000 |  ✗  |  ✗  |  2015  | corpus of presidential speeches |+| [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus | 
 +| [[en:cnk:schola2010|SCHOLA2010]] |  790k |  ✗  |  ✗  |  2010  | corpus of school lessons | 
 +| [[en:cnk:speeches|SPEECHES]] |  215k |  ✗  |  ✗  |  2015  | corpus of presidential speeches 
 +| [[en:cnk:parlcorp|Parlcorp]] |  38M |  ✓  |  ✓  |  2015  | corpus of Czech parliamentary speeches (1993-2021) |
 ^ <fs large>Diachronic corpora</fs> ^^^^^^ ^ <fs large>Diachronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
-| [[en:cnk:diakorp|DIAKORP]] (version 6) |  3,4 mil. |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC |+| [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC 
 +| [[en:cnk:onomos|OnomOs]] |  200k |  ✓  |  ✓  |  2023  | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
 ^ <fs large>Foreign language corpora</fs> ^^^^^^ ^ <fs large>Foreign language corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze9|version 9]]) |  1,46 mil. |  (✓)  |  (✓)  |  2008  | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13ud|release 13ud]], [[en:cnk:intercorp:verze15|release 15]], [[en:cnk:intercorp:verze16|release 16]]) |  5.3G |  (✓)  |  (✓)  |  2008–2023  | versioned parallel corpus for 61 languages | 
 +[[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
-| [[en:cnk:aranea|Aranea]] |  1 000 mil. |  ✓  |  ✓  |  2014  |  comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | +| [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | 
-| [[en:cnk:dewac|deWaC]] |  1 350 mil. |  ✓  |  ✓  |  2013  | web corpus of German | +| [[en:cnk:dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German | 
-| [[en:cnk:frwac|frWaC]] |  1 350 mil. |  ✓  |  ✓  |  2013  | web corpus of French | +| [[en:cnk:frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French | 
-| [[en:cnk:itwac|itWaC]] |  1 600 mil. |  ✓  |  ✓  |  2013  | web corpus of Italian | +| [[en:cnk:itwac|itWaC]] |  1.6G |  ✓  |  ✓  |  2013  | web corpus of Italian | 
-| [[en:cnk:ukwac|ukWaC]] |  1 900 mil. |  ✓  |  ✓  |  2013  | web corpus of British English |+| [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English |
 | **Specialized foreign language corpora** |||||| | **Specialized foreign language corpora** ||||||
-| [[en:cnk:dotko|DOTKO]] |  12 mil. |  ✗   ✗  |  2010  | non-reference corpus of Lower Sorbian, most of the texts are from 1848--1933 +| [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 | 
-| [[cnk:eebo|EEBO]] |  730 mil. |  ✗  |  ✗  |  2015  | English texts from the period 1475-1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | +| [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓   ✗  |  2010  | non-reference corpus of Lower Sorbian | 
-| [[en:cnk:hotko|HOTKO]] |  36 mil. |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian | +| [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | 
-| [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73 mil. |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain |+| [[en:cnk:hotko|HOTKO]] (version 2) |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian | 
 +| [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain 
 +| [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish | 
 +| [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 |