Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:koditex [2018/02/21 13:57] – [Sources of data] lukes | en:cnk:koditex [2018/11/01 16:15] (current) – [How to cite Koditex] vaclavcvrcek |
---|
====== The Koditex Corpus ====== | ====== The Koditex Corpus ====== |
| |
Koditex is a 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech. Its primary goal is to be as diverse and as representative of the wide range of uses of language as possible. At the topmost level, texts are classified into three modes of communication: | Koditex is a synchronic, representative and reference 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech. |
| |
<WRAP right 35%> | <WRAP right 35%> |
^ <fs medium>Name</fs> ^^ <fs medium>Koditex</fs> ^ | ^ <fs medium>Name</fs> ^^ <fs medium>Koditex</fs> ^ |
^ Positions ^ Number of positions (tokens) | 10,880,550 | | ^ Positions ^ Number of positions (tokens) | 10,880,550 | |
^ ::: ^ Number of positions (excl. punctuation) | 9,139,930 | | ^ ::: ^ Number of positions (excl. punctuation) | 9,139,930 | |
| ^ ::: ^ Number of tokens (excl. punctuation) used in factor analysis | 9,039,137| |
^ ::: ^ Number of word forms | 509,764 | | ^ ::: ^ Number of word forms | 509,764 | |
^ ::: ^ Number of lemmas | 205,592 | | ^ ::: ^ Number of lemmas | 205,592 | |
</WRAP> | </WRAP> |
| |
* written language (//wri//), | When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation (the texts were [[en:pojmy:lemma|lemmatized]], [[en:pojmy:tag|morphologically tagged]] using two different systems, and furthermore they were annotated for phrasemes and so-called [[http://ufal.mff.cuni.cz/nametag|named entities]]). As far as writtenness and spokenness are concerned, the Koditex is a mixed corpus. |
* spoken language (//spo//) and | |
* web-based communication (//web//). | The name //Koditex// is both an acronym of the Czech version of the phrase //**co**rpus of **di**versified **tex**ts// and a tribute to Vilém Kodýtek, author of a pioneering attempt to apply MDA to Czech based on the work of D. Biber. |
| |
The name //Koditex// is both an acronym of the Czech version of the phrase //**co**rpus of **di**versified **tex**ts// and a tribute to Vilém Kodýtek, author of a pioneering attempt to apply MDA to Czech. | |
| |
===== Corpus design ===== | ===== Corpus design ===== |
| |
Unlike other CNC corpora (e.g. [[en:cnk:syn2015|SYN2015]]), the corpus consists of text samples (called chunks). Before sampling the assembled data for material to include in the final corpus, we decided to split texts longer than 5,000 words into contiguous chunks of 2,000–5,000 words (while respecting sentence boundaries). This decision was driven by several perceived advantages, primarily that of ensuring a higher overall diversity of the corpus in terms of registers as well as genres / text types. | Unlike CNC's other synchronic corpora (e.g. [[en:cnk:syn2015|SYN2015]]), the Koditex is not made up of entire texts, but rather samples from the original texts, which are marked as ''<chunk>'' within the structure. |
| |
| Before sampling the assembled data for material to include in the final corpus, we decided to split texts longer than 5,000 words into contiguous chunks of 2,000–5,000 words (while respecting sentence boundaries). This decision was driven by several perceived advantages, primarily that of ensuring a higher overall diversity of the corpus in terms of registers as well as genres / text types. |
| |
| At the topmost level, texts are classified into three modes of communication: |
| * written language (//wri//), |
| * spoken language (//spo//) and |
| * web-based communication (//web//). |
| |
Chunks are divided into modes (see above). Each mode is further subdivided into two or more divisions (e.g. the written mode subdivides into fiction, non-fiction, journalism and private correspondence), divisions then branch into classes of texts, aiming at roughly 200,000 words per class (subject to data availability). For the written mode, we introduced an intermediate superclass level which groups several related text classes together. | Each of the three modes is further subdivided into two or more divisions (e.g. the written mode is subdivided into fiction, non-fiction, journalism and private correspondence). Divisions then branch into classes of texts (e.g. crime novel), aiming at roughly 200,000 words per class (subject to data availability). For the written mode, we introduced an intermediate superclass level which groups several related text classes together. |
| |
Some texts had to be removed from the data set prior to performing the MDA due to technical reasons. These texts are identified in the corpus by the attribute ''include="no"'' in their metadata. The table below summarizes only texts which were actually included in the MDA: | Some texts had to be removed from the data set prior to performing the MDA due to technical reasons. These texts are identified in the corpus by the attribute ''include="no"'' in their metadata. The table below summarizes the composition of the Koditex corpus, taking into account only those texts which were actually included in the MDA (i.e. bearing the attribute ''include="yes"''): |
| |
^ MODE ^ DIVISION ^ SUPERCLASS ^ CLASS ^ Tokens ^ Text chunks ^ | ^ MODE ^ DIVISION ^ SUPERCLASS ^ CLASS ^ Tokens ^ Text chunks ^ |
The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). | The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). |
| |
^ Class ^ Translations (words) ^ Originals (words) ^ % Translations ^ | ^ Class ^ Translations (words) ^ Originals (words) ^ % translations ^ |
| LOV | 210,250 | 30,981 | 87.2% | | | LOV | 210,250 | 30,981 | 87.2% | |
| CRM | 202,921 | 37,677 | 84.3% | | | CRM | 202,921 | 37,677 | 84.3% | |
| |
Several layers of annotation were added to the corpus in order to facilitate operationalization of features: | Several layers of annotation were added to the corpus in order to facilitate operationalization of features: |
| |
* lemmatization and morphological tagging; two systems were used: the [[http://ufal.mff.cuni.cz/morphodita|MorphoDiTa]] stochastic tagger((Straková Jana, Milan Straka & Jan Hajič. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In //Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations//, 13–18. Baltimore, MD: ACL.)) and a hybrid tagger combining stochastic and rule-based disambiguation((Spoustová, Drahomíra, Jan Hajič, Jan Votrubec, Pavel Krbec & Pavel Květoň. 2007. The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech. In //Proceedings of the Workshop on Balto-Slavonic Natural Language Processing//, ACL 2007. 67–74; Jelínek, Tomáš. 2008. Nové značkování v Českém národním korpusu [New tagging in the Czech National Corpus]. //Naše řeč// 91(1). 13–20; Petkevič, Vladimír. 2014. Problémy automatické morfologické disambiguace češtiny [Problems of automatic morphological disambiguation of Czech]. //Naše řeč// 97(4). 194–207.)) | * lemmatization and morphological tagging; two systems were used: the [[http://ufal.mff.cuni.cz/morphodita|MorphoDiTa]] stochastic tagger((Straková Jana, Milan Straka & Jan Hajič. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In //Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations//, 13–18. Baltimore, MD: ACL.)) and a hybrid tagger combining stochastic and rule-based disambiguation((Spoustová, Drahomíra, Jan Hajič, Jan Votrubec, Pavel Krbec & Pavel Květoň. 2007. The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech. In //Proceedings of the Workshop on Balto-Slavonic Natural Language Processing//, ACL 2007. 67–74; Jelínek, Tomáš. 2008. Nové značkování v Českém národním korpusu [New tagging in the Czech National Corpus]. //Naše řeč// 91(1). 13–20; Petkevič, Vladimír. 2014. Problémy automatické morfologické disambiguace češtiny [Problems of automatic morphological disambiguation of Czech]. //Naše řeč// 97(4). 194–207.)) |
* phraseme annotation by the FRANTA system((Hnátková, Milena. 2002. Značkování frazémů a idiomů v Českém národním korpusu s pomocí Slovníku české frazeologie a idiomatiky [The tagging of phraseological units and idioms in the Czech National Corpus with the aid of the Dictionary of Czech phraseology and idiomatics]. //Slovo a slovesnost// 63(2). 117–126.)) | * phraseme annotation by the FRANTA system((Hnátková, Milena. 2002. Značkování frazémů a idiomů v Českém národním korpusu s pomocí Slovníku české frazeologie a idiomatiky [The tagging of phraseological units and idioms in the Czech National Corpus with the aid of the Dictionary of Czech phraseology and idiomatics]. //Slovo a slovesnost// 63(2). 117–126.)) |
* named-entity recognition using the [[http://ufal.mff.cuni.cz/nametag|NameTag tool]]((Straková Jana, Milan Straka & Jan Hajič. 2013. A New State-of-The-Art Czech Named Entity Recognizer. In Ivan Habernal & Václav Matoušek (eds.), //Text, Speech and Dialogue//, 68–75. Berlin & Heidelberg: Springer Verlag.)) | * named-entity recognition using the [[http://ufal.mff.cuni.cz/nametag|NameTag tool]]((Straková Jana, Milan Straka & Jan Hajič. 2013. A New State-of-The-Art Czech Named Entity Recognizer. In Ivan Habernal & Václav Matoušek (eds.), //Text, Speech and Dialogue//, 68–75. Berlin & Heidelberg: Springer Verlag.)) |
| |
| The following statistical models were used with MorphoDiTa and NameTag: |
| |
| * Straka, Milan & Jana Straková. 2016. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-1836 |
| * Straka, Milan & Jana Straková. 2014. Czech Models (CNEC) for NameTag. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11858/00-097C-0000-0023-7D42-8 |
===== Sources of data ===== | ===== Sources of data ===== |
| |
The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Karel Pala and Vít Baisa from the [[https://nlp.fi.muni.cz/en/NLPCentre|NLPC at Masaryk University]], and Josef Šlerka and his team at Socialinsider, for providing raw data for the //wik// class and //mul// division, respectively. | The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Martin Prošek and Petr Kaderka from the [[http://www.ujc.cas.cz/en|Czech Language Institute]] of the Czech Academy of Sciences for providing data from the [[http://ujc.dialogy.cz/?q=en/node/80|DIALOG]] corpus, Karel Pala and Vít Baisa from the [[https://nlp.fi.muni.cz/en/NLPCentre|NLPC at Masaryk University]], and Josef Šlerka and his team at Socialinsider, for providing raw data for the //wik// class and //mul// division, respectively. |
| |
* Benešová, Lucie, Michal Křen & Martina Waclawičová. 2013. [[cnk:oral2013|ORAL2013]]. | The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: |
* Benko, Vladimír. 2015. [[cnk:aranea|Araneum]] Bohemicum Maius, version 15.04. Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague. | |
* Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[cnk:speeches|SPEECHES]]. | * Benešová, Lucie, Michal Křen & Martina Waclawičová. 2013. [[en:cnk:oral2013|ORAL2013]]. |
* Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[cnk:pmk|PMK]]. | * Benko, Vladimír. 2015. [[en:cnk:aranea|Araneum]] Bohemicum Maius, version 15.04. ÚČNK FF UK. |
* Hladká, Zdeňka. 2002. [[cnk:bmk|BMK]]. | * Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[en:cnk:speeches|SPEECHES]]. |
* Hladká, Zdeňka. 2006. [[cnk:ksk-dopisy|KSK]]. | * Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[en:cnk:pmk|PMK]]. |
* Křen, Michal et al. 2015. [[cnk:syn2015|SYN2015]]. | * Hladká, Zdeňka. 2002. [[en:cnk:bmk|BMK]]. |
* The DIALOG Corpus, version 1.2. 2015. Czech Language Institute of the Czech Academy of Sciences, Prague. http://ujc.dialogy.cz | * Hladká, Zdeňka. 2006. [[en:cnk:ksk-dopisy|KSK]]. |
| * Křen, Michal et al. 2015. [[en:cnk:syn2015|SYN2015]]. |
| * Straka, Milan & Jana Straková. 2014. Czech Models (CNEC) for NameTag. LINDAT/CLARIN ÚFAL MFF UK. [[http://hdl.handle.net/11858/00-097C-0000-0023-7D42-8]] |
| * Straka, Milan & Jana Straková. 2016. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115. LINDAT/CLARIN ÚFAL MFF UK. http://hdl.handle.net/11234/1-1836 |
| * The DIALOG Corpus, version 1.2. 2015. ÚJČ AV ČR. Praha. http://ujc.dialogy.cz |
* The EUROPARL Corpus (the Proceedings of the European Parliament). http://www.europarl.eu.int/ | * The EUROPARL Corpus (the Proceedings of the European Parliament). http://www.europarl.eu.int/ |
| |
| |
<WRAP round tip 70%> | <WRAP round tip 70%> |
Zasina, Adrian J., David Lukeš, Zuzana Komrsková, Petra Poukarová & Anna Řehořková. 2018. Koditex (A corpus of diversified texts). Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague. | Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: //Koditex: A corpus of diversified texts//. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: www.korpus.cz |
</WRAP> | </WRAP> |
| |