Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:koditex [2018/06/04 19:26] – [Sources of data] veronikapojarova | en:cnk:koditex [2018/11/01 16:15] (current) – [How to cite Koditex] vaclavcvrcek | ||
---|---|---|---|
Line 2: | Line 2: | ||
====== The Koditex Corpus ====== | ====== The Koditex Corpus ====== | ||
- | Koditex is a 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech. | + | Koditex is a synchronic, representative and reference |
Line 8: | Line 8: | ||
^ <fs medium> | ^ <fs medium> | ||
^ Positions ^ Number of positions (tokens) | 10,880,550 | | ^ Positions ^ Number of positions (tokens) | 10,880,550 | | ||
- | ^ ::: ^ Number of positions (excl. punctuation) | 9,139,930 | | + | ^ ::: ^ Number of positions (excl. punctuation) | 9, |
+ | ^ ::: ^ Number of tokens (excl. punctuation) used in factor analysis | 9,039,137| | ||
^ ::: ^ Number of word forms | 509,764 | | ^ ::: ^ Number of word forms | 509,764 | | ||
^ ::: ^ Number of lemmas | 205,592 | | ^ ::: ^ Number of lemmas | 205,592 | | ||
Line 18: | Line 19: | ||
</ | </ | ||
- | When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation (the texts were [[en: | + | When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation (the texts were [[en: |
The name //Koditex// is both an acronym of the Czech version of the phrase // | The name //Koditex// is both an acronym of the Czech version of the phrase // | ||
Line 104: | Line 105: | ||
The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | ||
- | ^ Class ^ Translations (words) ^ Originals (words) ^ % Translations | + | ^ Class ^ Translations (words) ^ Originals (words) ^ % translations |
| LOV | 210,250 | 30,981 | 87.2% | | | LOV | 210,250 | 30,981 | 87.2% | | ||
| CRM | 202,921 | 37,677 | 84.3% | | | CRM | 202,921 | 37,677 | 84.3% | | ||
Line 133: | Line 134: | ||
===== Sources of data ===== | ===== Sources of data ===== | ||
- | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Karel Pala and Vít Baisa from the [[https:// | + | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Martin Prošek and Petr Kaderka from the [[http:// |
The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: | The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: | ||
Line 152: | Line 153: | ||
<WRAP round tip 70%> | <WRAP round tip 70%> | ||
- | Zasina, | + | Zasina, |
</ | </ | ||
+ |