Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:koditex [2018/06/05 11:25] – [The Koditex Corpus] petrapoukarova | en:cnk:koditex [2018/11/01 16:15] (current) – [How to cite Koditex] vaclavcvrcek | ||
---|---|---|---|
Line 105: | Line 105: | ||
The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | ||
- | ^ Class ^ Translations (words) ^ Originals (words) ^ % Translations | + | ^ Class ^ Translations (words) ^ Originals (words) ^ % translations |
| LOV | 210,250 | 30,981 | 87.2% | | | LOV | 210,250 | 30,981 | 87.2% | | ||
| CRM | 202,921 | 37,677 | 84.3% | | | CRM | 202,921 | 37,677 | 84.3% | | ||
Line 134: | Line 134: | ||
===== Sources of data ===== | ===== Sources of data ===== | ||
- | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Karel Pala and Vít Baisa from the [[https:// | + | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Martin Prošek and Petr Kaderka from the [[http:// |
The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: | The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: | ||
Line 153: | Line 153: | ||
<WRAP round tip 70%> | <WRAP round tip 70%> | ||
- | Zasina, | + | Zasina, |
</ | </ | ||
+ |