Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:koditex [2018/02/21 14:04] – [Annotation] lukes | en:cnk:koditex [2018/11/01 16:15] (current) – [How to cite Koditex] vaclavcvrcek | ||
---|---|---|---|
Line 2: | Line 2: | ||
====== The Koditex Corpus ====== | ====== The Koditex Corpus ====== | ||
- | Koditex is a 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech. | + | Koditex is a synchronic, representative and reference |
<WRAP right 35%> | <WRAP right 35%> | ||
^ <fs medium> | ^ <fs medium> | ||
^ Positions ^ Number of positions (tokens) | 10,880,550 | | ^ Positions ^ Number of positions (tokens) | 10,880,550 | | ||
- | ^ ::: ^ Number of positions (excl. punctuation) | 9,139,930 | | + | ^ ::: ^ Number of positions (excl. punctuation) | 9, |
+ | ^ ::: ^ Number of tokens (excl. punctuation) used in factor analysis | 9,039,137| | ||
^ ::: ^ Number of word forms | 509,764 | | ^ ::: ^ Number of word forms | 509,764 | | ||
^ ::: ^ Number of lemmas | 205,592 | | ^ ::: ^ Number of lemmas | 205,592 | | ||
Line 17: | Line 19: | ||
</ | </ | ||
- | * written | + | When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation |
- | * spoken language (//spo//) and | + | |
- | * web-based communication (//web//). | + | The name //Koditex// is both an acronym of the Czech version of the phrase |
- | The name //Koditex// is both an acronym of the Czech version of the phrase // | ||
===== Corpus design ===== | ===== Corpus design ===== | ||
- | Unlike | + | Unlike CNC's other synchronic |
+ | |||
+ | Before sampling the assembled data for material to include in the final corpus, we decided to split texts longer than 5,000 words into contiguous chunks of 2, | ||
+ | |||
+ | At the topmost level, texts are classified into three modes of communication: | ||
+ | * written language (// | ||
+ | * spoken language (//spo//) and | ||
+ | * web-based communication (//web//). | ||
- | Chunks are divided into modes (see above). | + | Each of the three modes is further subdivided into two or more divisions (e.g. the written mode is subdivided |
- | Some texts had to be removed from the data set prior to performing the MDA due to technical reasons. These texts are identified in the corpus by the attribute | + | Some texts had to be removed from the data set prior to performing the MDA due to technical reasons. These texts are identified in the corpus by the attribute '' |
^ MODE ^ DIVISION ^ SUPERCLASS ^ CLASS ^ Tokens ^ Text chunks ^ | ^ MODE ^ DIVISION ^ SUPERCLASS ^ CLASS ^ Tokens ^ Text chunks ^ | ||
Line 97: | Line 105: | ||
The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | ||
- | ^ Class ^ Translations (words) ^ Originals (words) ^ % Translations | + | ^ Class ^ Translations (words) ^ Originals (words) ^ % translations |
| LOV | 210,250 | 30,981 | 87.2% | | | LOV | 210,250 | 30,981 | 87.2% | | ||
| CRM | 202,921 | 37,677 | 84.3% | | | CRM | 202,921 | 37,677 | 84.3% | | ||
Line 126: | Line 134: | ||
===== Sources of data ===== | ===== Sources of data ===== | ||
- | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Karel Pala and Vít Baisa from the [[https:// | + | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Martin Prošek and Petr Kaderka from the [[http:// |
- | | + | The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: |
- | * Benko, Vladimír. 2015. [[cnk: | + | |
- | * Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[cnk: | + | |
- | * Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[cnk: | + | * Benko, Vladimír. 2015. [[en:cnk: |
- | * Hladká, Zdeňka. 2002. [[cnk: | + | * Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[en:cnk: |
- | * Hladká, Zdeňka. 2006. [[cnk: | + | * Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[en:cnk: |
- | * Křen, Michal et al. 2015. [[cnk: | + | * Hladká, Zdeňka. 2002. [[en:cnk: |
- | * The DIALOG Corpus, version 1.2. 2015. Czech Language Institute of the Czech Academy of Sciences, Prague. http:// | + | * Hladká, Zdeňka. 2006. [[en:cnk: |
+ | * Křen, Michal et al. 2015. [[en:cnk: | ||
+ | * Straka, Milan & Jana Straková. 2014. Czech Models (CNEC) for NameTag. LINDAT/ | ||
+ | * Straka, Milan & Jana Straková. 2016. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115. LINDAT/ | ||
+ | * The DIALOG Corpus, version 1.2. 2015. ÚJČ AV ČR. Praha. http:// | ||
* The EUROPARL Corpus (the Proceedings of the European Parliament). http:// | * The EUROPARL Corpus (the Proceedings of the European Parliament). http:// | ||
Line 141: | Line 153: | ||
<WRAP round tip 70%> | <WRAP round tip 70%> | ||
- | Zasina, | + | Zasina, |
</ | </ | ||
+ |