Differences

This shows you the differences between two versions of the page.

--- en:cnk:uvod [2017/06/02 14:18] – [Corpora of the Czech National Corpus project] michalkren
+++ en:cnk:uvod [2019/10/31 19:24] – alexandrrosen
@@ Line 7: / Line 7: @@
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is stated.))  ^ characteristic features ^
 | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze5|version 5]]) |  3.836G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |
+| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze7|version 7]]) |  4.255G |  ✓  |  ✓  |  2010  | versioned corpus, unification of all the SYN-series synchronic written corpora |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
 | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 |
@@ Line 18: / Line 18: @@
 | [[en:cnk:czesl-plain|CZESL-PLAIN]] |  2M |  ✗  |  ✗  |  2012  | non-reference learner corpus of non-native Czech speakers  |
 | [[en:cnk:czesl-sgt|CZESL-SGT]] |  960k |  ✓  |  ✓  |  2014  | non-reference learner corpus of non-native speakers’ Czech with automatic annotation |
+| [[en:cnk:czesl-sgt-basic|CZESL-SGT-BASIC]] |  960k |  ✓  |  ✓  |  2019  | same as CZESL-SGT except for a reduced set of metadata in the **Restrict search** section of the search interface |
+| [[en:cnk:fictree|FicTree]] |  135k |  ✓  |  ✓  |  2017  | manually annotated treebank of Czech fiction |
 | [[en:cnk:fsc2000|FSC2000]] |  100M |  ✓  |  ✗  |  2004  | modified [[en:cnk:syn2000|SYN2000]], source of the Frequency Dictionary of Czech |
-| [[JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies |
+| [[en:cnk:jerome|JEROME]] |  85M |  ✓  |  ✓  |  2013  | monolingual comparable corpus for translation studies |
+| [[en:cnk:koditex|Koditex]] |  10.8 mil. |  ✓  |  ✓  |  2018  | corpus for multi-dimensional analysis of Czech registers |
 | [[en:cnk:ksk-dopisy|KSK-DOPISY]] |  800k |  ✗  |  ✗  |  2006  | transcriptions of handwritten correspondence from 1990--2004|
 | [[en:cnk:link|LINK]] |  1.8M |  ✓  |  ✓  |  2010  | non-reference corpus of linguistic texts |
@@ Line 27: / Line 30: @@
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** ||||||
-| [[cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
+| [[en:cnk:ortofon|ORTOFON]] |  1M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
-| [[cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
+| [[en:cnk:oral|ORAL]] |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral2013|ORAL2013]] |  2.8M |  ✗  |  ✗  |  2013  | reference representative corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral2008|ORAL2008]] |  1M |  ✗  |  ✗  |  2008  | reference sociolinguistically balanced corpus of informal spoken Czech (speakers from Bohemia only) |
@@ Line 34: / Line 37: @@
 | **Specialized corpora** ||||||
 | [[en:cnk:bmk|BMK]] |  490k |  ✗  |  ✗  |  2002  | Brno spoken corpus |
-| [[cnk:dialekt|DIALEKT]] |  100k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |
+| [[en:cnk:dialekt|DIALEKT]] |  100k |  ✓  |  ✓  |  2017  | reference dialectal corpus with two-layer transcription |
 | [[en:cnk:lindsei_cz|LINDSEI_CZ]] |  120k |  ✗  |  ✗  |  2017  | learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech |
 | [[en:cnk:pmk|PMK]] |  675k |  ✗  |  ✗  |  2001  | Prague spoken corpus |
@@ Line 45: / Line 48: @@
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **Parallel corpora** ||||||
-| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze9|version 9]]) |  1.46G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] |
+| [[en:cnk:intercorp|InterCorp]] ([[en:cnk:intercorp:verze11|version 11]]) |  1.7G |  (✓)  |  (✓)  |  2008  | versioned parallel corpus being compiled as a part of the [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|InterCorp project]] |
 | **Comparable corpora** ||||||
 | [[en:cnk:aranea|Aranea]] |  1G |  ✓  |  ✓  |  2014  | comparable web corpora for several European languages (cs, de, en, es, fi, fr, hu, it, nl, pl, pt, ru, sk, zh) |
@@ Line 54: / Line 57: @@
 | **Specialized foreign language corpora** ||||||
 | [[en:cnk:dotko|DOTKO]] |  12M |  ✗  |  ✗  |  2010  | non-reference corpus of Lower Sorbian, most of the texts are from 1848--1933 |
-| [[cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475–-1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] |
+| [[en:cnk:eebo|EEBO]] |  730M |  ✗  |  ✗  |  2015  | English texts from the period 1475–-1700, [[http://www.textcreationpartnership.org/tcp-eebo/|Early English Books Online]] |
 | [[en:cnk:hotko|HOTKO]] |  36M |  ✗  |  ✗  |  2013  | non-reference corpus of Upper Sorbian |
 | [[en:cnk:lEstRepublicain|lEstRepublicain]] |  73M |  ✓  |  ✓  |  2013  | corpus of French newspaper L'Est Républicain |
+| [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish |

Trace:

Differences

Search

Navigation

Print/export

Tools

Languages

Licence