AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:koditex [2018/06/04 19:24] – [The Koditex Corpus] veronikapojarovaen:cnk:koditex [2018/11/01 16:15] (current) – [How to cite Koditex] vaclavcvrcek
Line 2: Line 2:
 ====== The Koditex Corpus ====== ====== The Koditex Corpus ======
  
-Koditex is a 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.+Koditex is a synchronic, representative and reference 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.
  
  
Line 8: Line 8:
 ^ <fs medium>Name</fs> ^^ <fs medium>Koditex</fs> ^ ^ <fs medium>Name</fs> ^^ <fs medium>Koditex</fs> ^
 ^ Positions ^ Number of positions (tokens) |  10,880,550 |   ^ Positions ^ Number of positions (tokens) |  10,880,550 |  
-^ ::: ^ Number of positions (excl. punctuation) |  9,139,930 |  +^ ::: ^ Number of positions (excl. punctuation) |  9,139,930 
 +^ ::: ^ Number of tokens (excl. punctuation) used in factor analysis |  9,039,137|  
 ^ ::: ^ Number of word forms |  509,764 |   ^ ::: ^ Number of word forms |  509,764 |  
 ^ ::: ^ Number of lemmas |  205,592 | ^ ::: ^ Number of lemmas |  205,592 |
Line 18: Line 19:
 </WRAP> </WRAP>
  
-When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation (the texts were [[en:pojmy:lemma|lemmatized]], [[en:pojmy:tag|morphologically tagged]] using two different systems, and furthermore they were annotated for phrasemes and so-called [[http://ufal.mff.cuni.cz/nametag|named entities]]). As far as writtenness is concerned, the Koditex is a mixed corpus, with some of its other attributes including: synchronic, representative and reference, i.e. unchanging.+When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation (the texts were [[en:pojmy:lemma|lemmatized]], [[en:pojmy:tag|morphologically tagged]] using two different systems, and furthermore they were annotated for phrasemes and so-called [[http://ufal.mff.cuni.cz/nametag|named entities]]). As far as writtenness and spokenness are concerned, the Koditex is a mixed corpus.
  
 The name //Koditex// is both an acronym of the Czech version of the phrase //**co**rpus of **di**versified **tex**ts// and a tribute to Vilém Kodýtek, author of a pioneering attempt to apply MDA to Czech based on the work of D. Biber.  The name //Koditex// is both an acronym of the Czech version of the phrase //**co**rpus of **di**versified **tex**ts// and a tribute to Vilém Kodýtek, author of a pioneering attempt to apply MDA to Czech based on the work of D. Biber. 
Line 104: Line 105:
 The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals). The majority of texts (accounting for 76% of tokens) included in the corpus are Czech originals (not translations from other languages). The only exceptions are text classes where translated material is common in Czech in general, listed in the table below (the rest of the classes are 100% Czech originals).
  
-^ Class ^ Translations (words) ^ Originals (words) ^ % Translations ^+^ Class ^ Translations (words) ^ Originals (words) ^ % translations ^
 | LOV |  210,250 |  30,981 |  87.2% | | LOV |  210,250 |  30,981 |  87.2% |
 | CRM |  202,921 |  37,677 |  84.3% | | CRM |  202,921 |  37,677 |  84.3% |
Line 133: Line 134:
 ===== Sources of data ===== ===== Sources of data =====
  
-The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Karel Pala and Vít Baisa from the [[https://nlp.fi.muni.cz/en/NLPCentre|NLPC at Masaryk University]], and Josef Šlerka and his team at Socialinsider, for providing raw data for the //wik// class and //mul// division, respectively.+The vast majority of the material in the Koditex corpus draws on the resources of the Czech National Corpus (CNC); types of language data which are not collected by the CNC were acquired from other research centers. We would also like to thank Martin Prošek and Petr Kaderka from the [[http://www.ujc.cas.cz/en|Czech Language Institute]] of the Czech Academy of Sciences for providing data from the [[http://ujc.dialogy.cz/?q=en/node/80|DIALOG]] corpus, Karel Pala and Vít Baisa from the [[https://nlp.fi.muni.cz/en/NLPCentre|NLPC at Masaryk University]], and Josef Šlerka and his team at Socialinsider, for providing raw data for the //wik// class and //mul// division, respectively.
  
 The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here: The Koditex corpus was created by sampling various sources and using a number of tools, all of which are cited here:
  
-  * Benešová, Lucie, Michal Křen & Martina Waclawičová. 2013. [[cnk:oral2013|ORAL2013]]. +  * Benešová, Lucie, Michal Křen & Martina Waclawičová. 2013. [[en:cnk:oral2013|ORAL2013]]. 
-  * Benko, Vladimír. 2015. [[cnk:aranea|Araneum]] Bohemicum Maius, version 15.04. ÚČNK FF UK. +  * Benko, Vladimír. 2015. [[en:cnk:aranea|Araneum]] Bohemicum Maius, version 15.04. ÚČNK FF UK. 
-  * Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[cnk:speeches|SPEECHES]]. +  * Cvrček, Václav, Petr Truneček & Václav Horký. 2015. [[en:cnk:speeches|SPEECHES]]. 
-  * Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[cnk:pmk|PMK]]. +  * Čermák, František, Ana Adamovičová & Jiří Pešička. 2001. [[en:cnk:pmk|PMK]]. 
-  * Hladká, Zdeňka. 2002. [[cnk:bmk|BMK]]. +  * Hladká, Zdeňka. 2002. [[en:cnk:bmk|BMK]]. 
-  * Hladká, Zdeňka. 2006. [[cnk:ksk-dopisy|KSK]]. +  * Hladká, Zdeňka. 2006. [[en:cnk:ksk-dopisy|KSK]]. 
-  * Křen, Michal et al. 2015.  [[cnk:syn2015|SYN2015]].+  * Křen, Michal et al. 2015.  [[en:cnk:syn2015|SYN2015]].
   * Straka, Milan & Jana Straková. 2014. Czech Models (CNEC) for NameTag. LINDAT/CLARIN ÚFAL MFF UK. [[http://hdl.handle.net/11858/00-097C-0000-0023-7D42-8]]   * Straka, Milan & Jana Straková. 2014. Czech Models (CNEC) for NameTag. LINDAT/CLARIN ÚFAL MFF UK. [[http://hdl.handle.net/11858/00-097C-0000-0023-7D42-8]]
   * Straka, Milan & Jana Straková. 2016. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115. LINDAT/CLARIN ÚFAL MFF UK. http://hdl.handle.net/11234/1-1836   * Straka, Milan & Jana Straková. 2016. Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115. LINDAT/CLARIN ÚFAL MFF UK. http://hdl.handle.net/11234/1-1836
Line 152: Line 153:
  
 <WRAP round tip 70%> <WRAP round tip 70%>
-Zasina, Adrian J., David Lukeš, Zuzana Komrsková, Petra Poukarová  & Anna Řehořková. 2018. Koditex (A corpus of diversified texts)Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague.+Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: //KoditexA corpus of diversified texts//. Institute of the Czech National Corpus, Faculty of Arts, Charles UniversityPrague 2018. Available at WWW: www.korpus.cz
 </WRAP> </WRAP>
 +