Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:koditex [2018/06/03 13:56] – veronikapojarova | en:cnk:koditex [2018/11/01 16:15] (current) – [How to cite Koditex] vaclavcvrcek | ||
---|---|---|---|
Line 2: | Line 2: | ||
====== The Koditex Corpus ====== | ====== The Koditex Corpus ====== | ||
- | Koditex is a 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech. | + | Koditex is a synchronic, representative and reference |
<WRAP right 35%> | <WRAP right 35%> | ||
^ <fs medium> | ^ <fs medium> | ||
^ Positions ^ Number of positions (tokens) | 10,880,550 | | ^ Positions ^ Number of positions (tokens) | 10,880,550 | | ||
- | ^ ::: ^ Number of positions (excl. punctuation) | 9,139,930 | | + | ^ ::: ^ Number of positions (excl. punctuation) | 9, |
+ | ^ ::: ^ Number of tokens (excl. punctuation) used in factor analysis | 9,039,137| | ||
^ ::: ^ Number of word forms | 509,764 | | ^ ::: ^ Number of word forms | 509,764 | | ||
^ ::: ^ Number of lemmas | 205,592 | | ^ ::: ^ Number of lemmas | 205,592 | | ||
Line 17: | Line 19: | ||
</ | </ | ||
+ | When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation (the texts were [[en: | ||
+ | |||
+ | The name //Koditex// is both an acronym of the Czech version of the phrase // | ||
- | The name //Koditex// is both an acronym of the Czech version of the phrase // | ||
===== Corpus design ===== | ===== Corpus design ===== | ||
- | Unlike other CNCsynchronic | + | Unlike |
Before sampling the assembled data for material to include in the final corpus, we decided to split texts longer than 5,000 words into contiguous chunks of 2, | Before sampling the assembled data for material to include in the final corpus, we decided to split texts longer than 5,000 words into contiguous chunks of 2, | ||
Line 101: | Line 105: | ||
The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | ||
- | ^ Class ^ Translations (words) ^ Originals (words) ^ % Translations | + | ^ Class ^ Translations (words) ^ Originals (words) ^ % translations |
| LOV | 210,250 | 30,981 | 87.2% | | | LOV | 210,250 | 30,981 | 87.2% | | ||
| CRM | 202,921 | 37,677 | 84.3% | | | CRM | 202,921 | 37,677 | 84.3% | | ||
Line 130: | Line 134: | ||
===== Sources of data ===== | ===== Sources of data ===== | ||
- | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Karel Pala and Vít Baisa from the [[https:// | + | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Martin Prošek and Petr Kaderka from the [[http:// |
The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: | The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: | ||
- | * Benešová, Lucie, Michal Křen & Martina Waclawičová. 2013. [[cnk: | + | * Benešová, Lucie, Michal Křen & Martina Waclawičová. 2013. [[en:cnk: |
- | * Benko, Vladimír. 2015. [[cnk: | + | * Benko, Vladimír. 2015. [[en:cnk: |
- | * Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[cnk: | + | * Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[en:cnk: |
- | * Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[cnk: | + | * Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[en:cnk: |
- | * Hladká, Zdeňka. 2002. [[cnk: | + | * Hladká, Zdeňka. 2002. [[en:cnk: |
- | * Hladká, Zdeňka. 2006. [[cnk: | + | * Hladká, Zdeňka. 2006. [[en:cnk: |
- | * Křen, Michal et al. 2015. [[cnk: | + | * Křen, Michal et al. 2015. [[en:cnk: |
* Straka, Milan & Jana Straková. 2014. Czech Models (CNEC) for NameTag. LINDAT/ | * Straka, Milan & Jana Straková. 2014. Czech Models (CNEC) for NameTag. LINDAT/ | ||
* Straka, Milan & Jana Straková. 2016. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115. LINDAT/ | * Straka, Milan & Jana Straková. 2016. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115. LINDAT/ | ||
Line 149: | Line 153: | ||
<WRAP round tip 70%> | <WRAP round tip 70%> | ||
- | Zasina, | + | Zasina, |
</ | </ | ||
+ |