AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
en:cnk:uvod [2021/06/30 11:35] – [Corpora of the Czech National Corpus project] michalkrenen:cnk:uvod [2023/12/29 12:24] michalkren
Line 5: Line 5:
  
 ^ <fs large>Written synchronic corpora </fs> ^^^^^^ ^ <fs large>Written synchronic corpora </fs> ^^^^^^
-^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^+^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) |  4.5G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |+| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze12|version 12]]) |  5G |  ✓  |  ✓  |  2010–2023  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 | [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 | | [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
Line 17: Line 17:
 | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 | | [[en:cnk:syn2000|SYN2000]] |  100M |  ✓  |  ✓  |  2000  | reference representative corpus, most of the texts are from 1990--1999 |
 | **Web corpora** |||||| | **Web corpora** ||||||
-| [[en:cnk:online|ONLINE]] |  > 6 bil. |  ✓  |  ✓  |  2020  | monitor corpus of Czech internet |+| [[en:cnk:online|ONLINE]] ([[en:cnk:online:gen2|2nd generation]]) |  > 6G |  ✓  |  ✓  |  2020  | monitor corpus of Czech internet |
 | [[en:cnk:net|NET]] (version 2) |  176M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication | | [[en:cnk:net|NET]] (version 2) |  176M |  ✓  |  ✓  |  2019  | corpus of semi-official internet communication |
 | **Learner corpora** |||||| | **Learner corpora** ||||||
Line 25: Line 25:
 | [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] |  960k |  ✓  |  ✓  |  2019  | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface | | [[en:cnk:czesl-sgt-basic|CzeSL-SGT-basic]] |  960k |  ✓  |  ✓  |  2019  | CzeSL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
 | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays | | [[en:cnk:skript2012|SKRIPT2012]] |  590k |  ✓  |  ✓  |  2013  | corpus of school essays |
 +| [[en:cnk:vespa_cz|VESPA_CZ]] |  500k |  ✓  |  ✓  |  2022  | learner corpus of written academic English by advanced speakers, whose L1 is Czech |
 | **Author corpora** |||||| | **Author corpora** ||||||
 | [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek | | [[en:cnk:capek|Capek]] |  2.3M |  ✓  |  ✓  |  2007  | author corpus of texts written exclusively by Karel Čapek |
Line 33: Line 34:
 | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  | | [[en:cnk:orwell|ORWELL]] |  80k |  ✓  |  ✓  |  2003  | Orwell's novel [[wp>Nineteen_Eighty-Four|1984]], manually annotated  |
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
 +| [[en:cnk:etalon|Etalon]] |  1.9M |  ✓  |  ✓  |  2021  | manually annotated corpus of Czech texts |
 | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction | | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction |
 | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
 | [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies | | [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies |
-| [[en:cnk:koditex|Koditex]] |  10.8 mil. |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers | +| [[en:cnk:koditex|Koditex]] |  10.8M |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers | 
-| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|+| [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004 
 +| [[en:cnk:ksp|KSP]] |  35.5M |  ✓  |  ✓  |  2022  | corpus of contemporary Czech poetry published in books and on literary servers from 1990--2020 |
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
 +| [[en:cnk:totalita|Totalita]] |  12,9M |  ✓  |  ✓  |  2010  | written language of the communist regime |
 ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^ ^ <fs large>Spoken synchronic corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
Line 50: Line 54:
 | **Specialized corpora** |||||| | **Specialized corpora** ||||||
 | [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus | | [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus |
-| [[en:cnk:dialekt|DIALEKT]] |  100k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |+| [[en:cnk:dialekt|DIALEKT]] (version 2) |  223k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |
 | [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech | | [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
 | [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus | | [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus |
Line 59: Line 63:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC | | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC |
 +| [[en:cnk:onomos|OnomOs]] |  200k |  ✓  |  ✓  |  2023  | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
 ^ <fs large>Foreign language corpora</fs> ^^^^^^ ^ <fs large>Foreign language corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13|version 13]]) |  1.8G |  (✓)  |  (✓)  |  2008–2020  | versioned parallel corpus for 40 languages |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze13ud|release 13ud]], [[en:cnk:intercorp:verze15|release 15]], [[en:cnk:intercorp:verze16|release 16]]) |  5.3G |  (✓)  |  (✓)  |  2008–2023  | versioned parallel corpus for 61 languages 
 +| [[en:cnk:psalm77|Psalm 77]] |  10k |  (✓)  |  (✓)  |  2023  | parallel corpus of 11 versions of Psalm 77 in Romanian, Church Slavonic and Greek |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
 | [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) | | [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
Line 70: Line 76:
 | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English | | [[en:cnk:ukwac|ukWaC]] |  1.9G |  ✓  |  ✓  |  2013  | web corpus of British English |
 | **Specialized foreign language corpora** |||||| | **Specialized foreign language corpora** ||||||
-| [[en:cnk:codit|CODIT]] |  27 mil. |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 | +| [[en:cnk:codit|CODIT]] |  27M |  ✗  |  ✗  |  2021  | diachronic corpus of Italian covering a period from the 13th century until 1947 | 
-| [[en:cnk:dotko|DOTKO]] |  12M |  ✗   ✗  |  2010  | non-reference corpus of Lower Sorbian, most of the texts are from 1848--1933 |+| [[en:cnk:dotko|DOTKO]] (version 2) |  15.5M |  ✓   ✗  |  2010  | non-reference corpus of Lower Sorbian |
 | [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] | | [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475--1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] |
 | [[en:cnk:hotko|HOTKO]] (version 2) |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian | | [[en:cnk:hotko|HOTKO]] (version 2) |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian |