AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
en:cnk:uvod [2018/11/07 15:30] – [Corpora of the Czech National Corpus project] michalkrenen:cnk:uvod [2019/12/20 15:28] – [Corpora of the Czech National Corpus project] michalkren
Line 7: Line 7:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze6|version 6]]) |  4.033G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |+| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze8|version 8]]) |  4.5G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
 | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 |
Line 18: Line 18:
 | [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers  | | [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers  |
 | [[en:cnk:czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation | | [[en:cnk:czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
 +| [[en:cnk:czesl-sgt-basic|CZESL-SGT-BASIC]] |  960k |  ✓  |  ✓  |  2019  | CZESL-SGT with a reduced set of metadata in the Restrict search section of the search interface |
 | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction | | [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction |
 | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech | | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
 | [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies | | [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies |
 +| [[en:cnk:koditex|Koditex]] |  10.8 mil. |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers |
 | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004| | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts | | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
Line 28: Line 30:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
 +| [[en:cnk:orator|ORATOR]] |  580k |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |
 | [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
Line 46: Line 49:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** |||||| | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze11|version 11]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] |+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze12|version 12]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus for 40 languages |
 | **Comparable corpora** |||||| | **Comparable corpora** ||||||
-| [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |+| [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
 | [[en:cnk:dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German | | [[en:cnk:dewac|deWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of German |
 | [[en:cnk:frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French | | [[en:cnk:frwac|frWaC]] |  1.35G |  ✓  |  ✓  |  2013  | web corpus of French |