AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
en:cnk:uvod [2021/02/17 21:44] michalkrenen:cnk:uvod [2023/12/29 12:24] michalkren
Line 5: Line 5:
  
 ^ <fs large>Written synchronic corpora </fs> ^^^^^^ ^ <fs large>Written synchronic corpora </fs> ^^^^^^
-^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^+^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) |  4.5G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |+| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze12|version 12]]) |  5G |  ✓  |  ✓  |  2010–2023  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 | [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 | | [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
Line 17: Line 17:
 | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 | | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 |
 | **Web corpora** |||||| | **Web corpora** ||||||
-| [[en:cnk:online|ONLINE]] |  > 6 bil. |  ✓  |  ✓  |  2020  | monitor corpus of Czech internet |+| [[en:cnk:online|ONLINE]] ([[en:cnk:online:gen2|2nd generation]]) |  > 6G |  ✓  |  ✓  |  2020  | monitor corpus of Czech internet |
 | [[en:cnk:net|NET]] (version 2) |  176M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication | | [[en:cnk:net|NET]] (version 2) |  176M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication |
 | **Learner corpora** |||||| | **Learner corpora** ||||||
Line 25: Line 25:
 | [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] |  960k |  ✓  |  ✓  |  2019  | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface | | [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] |  960k |  ✓  |  ✓  |  2019  | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
 | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays | | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays |
 +| [[en:cnk:vespa_cz|VESPA_CZ]] |  500k |  ✓  |  ✓  |  2022  | learner corpus of written academic English by advanced speakers, whose L1 is Czech |
 | **Author corpora** |||||| | **Author corpora** ||||||
 | [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek | | [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek |
 | [[en:cnk:capek|Capek_uplny]] |  2.5M |  ✓  |  ✓  |  2007  | author corpus of texts written or co-authored by Karel Čapek | | [[en:cnk:capek|Capek_uplny]] |  2.5M |  ✓  |  ✓  |  2007  | author corpus of texts written or co-authored by Karel Čapek |
 | [[en:cnk:cep|Cep]] |  420k |  ✓  |  ✓  |  2015  | author corpus of prosaic texts written by Jan Čep | | [[en:cnk:cep|Cep]] |  420k |  ✓  |  ✓  |  2015  | author corpus of prosaic texts written by Jan Čep |
-| [[cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence |+| [[en:cnk:kh-dopisy|KH-DOPISY]] |  500k |  ✗  |  ✗  |  2017  | corpus of Karel Havlíček's correspondence 
 +| [[en:cnk:kh-noviny|KH-NOVINY]] |  1M |  ✗  |  ✗  |  2021  | corpus of Karel Havlíček's journalism |
 | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  | | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
 +| [[en:cnk:etalon|Etalon]] |  1.9M |  ✓  |  ✓  |  2021  | manually annotated corpus of Czech texts |
 | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction | | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction |
 | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
 | [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies | | [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies |
-| [[en:cnk:koditex|Koditex]] |  10.8 mil. |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers | +| [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers | 
-| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|+| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 
 +| [[en:cnk:ksp|KSP]] |  35.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
 +| [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  | written language of the communist regime |
 ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^ ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
Line 49: Line 54:
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
 | [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus | | [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus |
-| [[en:cnk:dialekt|DIALEKT]] |  100k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |+| [[en:cnk:dialekt|DIALEKT]] (version 2) |  223k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |
 | [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech | | [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
 | [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus | | [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus |
 | [[en:cnk:schola2010|SCHOLA2010]] |  790k |  ✗  |  ✗  |  2010  | corpus of school lessons | | [[en:cnk:schola2010|SCHOLA2010]] |  790k |  ✗  |  ✗  |  2010  | corpus of school lessons |
 | [[en:cnk:speeches|SPEECHES]] |  215k |  ✗  |  ✗  |  2015  | corpus of presidential speeches | | [[en:cnk:speeches|SPEECHES]] |  215k |  ✗  |  ✗  |  2015  | corpus of presidential speeches |
 +| [[en:cnk:parlcorp|Parlcorp]] |  38M |  ✓  |  ✓  |  2015  | corpus of Czech parliamentary speeches (1993-2021) |
 ^ <fs large>Diachronic corpora</fs> ^^^^^^ ^ <fs large>Diachronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC | | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC |
 +| [[en:cnk:onomos|OnomOs]] |  200k |  ✓  |  ✓  |  2023  | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
 ^ <fs large>Foreign language corpora</fs> ^^^^^^ ^ <fs large>Foreign language corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13|version 13]]) |  1.8G |  (✓)  |  (✓)  |  2008–2020  | versioned parallel corpus for 40 languages |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13ud|release 13ud]], [[en:cnk:intercorp:verze15|release 15]], [[en:cnk:intercorp:verze16|release 16]]) |  5.3G |  (✓)  |  (✓)  |  2008–2023  | versioned parallel corpus for 61 languages 
 +| [[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
 | [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | | [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
Line 68: Line 76:
 | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English | | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English |
 | **Specialized foreign language corpora** |||||| | **Specialized foreign language corpora** ||||||
-| [[en:cnk:dotko|DOTKO]] |  12M |  ✗   ✗  |  2010  | non-reference corpus of Lower Sorbian, most of the texts are from 1848--1933 |+| [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 | 
 +| [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓   ✗  |  2010  | non-reference corpus of Lower Sorbian |
 | [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | | [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] |
-| [[en:cnk:hotko|HOTKO]] |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian |+| [[en:cnk:hotko|HOTKO]] (version 2) |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian |
 | [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain | | [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain |
 | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish | | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish |
 | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 | | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 |