

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:lemtag_mluv [2017/07/07 11:10] – [Modifications to the morphological dictionary] veronikapojarovaen:cnk:lemtag_mluv [2017/07/18 15:12] (current) – [Lemmatization and tagging in spoken corpora] michalkren
Line 14: Line 14:
 **Tagging method** **Tagging method**
-[[en:seznamy:tagy#pozice_1_-_slovni_druh|The morphological tagging system]] is the same as for written corpora, however, some tags for associated categories are retained (e.g. X for any gender, Y for masculine animate or inanimate etc.) just as they are contained in the morphological dictionary MorfFlex CZ (Hajič–Hlaváčová, 2013). This dictionary was manually and semiautomatically supplemented by frequently unrecognised forms (e.g. dialectal suffixes, forms with varying quantity, prothetic v). The stochastic tagging system MorphoDiTa (Straka a kol., 2014) was used for the tagging itself.+[[seznamy:tagy#pozice_1_-_slovni_druh|The morphological tagging system]] (the description is in Czech only) is the same as for written corpora, however, some tags for associated categories are retained (e.g. X for any gender, Y for masculine animate or inanimate etc.) just as they are contained in the morphological dictionary MorfFlex CZ (Hajič–Hlaváčová, 2013). This dictionary was manually and semiautomatically supplemented by frequently unrecognised forms (e.g. dialectal suffixes, forms with varying quantity, prothetic v). The stochastic tagging system MorphoDiTa (Straka a kol., 2014) was used for the tagging itself.
 ===== Modifications to the morphological dictionary ===== ===== Modifications to the morphological dictionary =====
Line 39: Line 39:
   * addition of the particle category: //jen// (originally only adverb)   * addition of the particle category: //jen// (originally only adverb)
   * new interpretation: //puč// is no longer a noun, but the imperative of the verb //půjčit// with reduced pronunciation   * new interpretation: //puč// is no longer a noun, but the imperative of the verb //půjčit// with reduced pronunciation
-===== Podoba lemmatu =====+===== Lemma forms =====
-  * většina slov má lemma v podobě **spisovného lemmatu**, tedy stejnou jako v psaném jazyce, a to v případechkdy regionální podoba frekvenčně převažuje (napřpod lemma **//týden//** spadají všechny tvary regionálních variant //tejden, tyden, tydeň, tédeň//+  * most words have a lemma in the form of a **standard lemma**, i.e. the same as in written languageeven in cases where the regional form has a higher frequency (e.g. the lemma **//týden//** subsumes all regional variant forms //tejden, tyden, tydeň, tédeň//
-  * slova s** dvojí spisovnou podobou** mají vícenásobné lemma (//polívka/polévka//+  * words with a **dual standard form** have a multiple lemma (//polívka/polévka//
-  * slova, u nichž **nelze jednoznačně přiřadit jednotlivé tvary**, mají také vícenásobné lemma (//myslet/myslit, muset/musit//+  * words which **can not be unambiguously assigned one specific form**, also have a multiple lemma (//myslet/myslit, muset/musit//
-  * **zkratky** mají vícenásobné lemma: //SMS/esemeska, endéer/NDR// +  * **abbreviations** have a multiple lemma: //SMS/esemeska, endéer/NDR// 
-Vícenásobné lemma funguje jako multihodnotato znamenáže při zadání jedné z možností vždy dostaneme všechny tvary přiřazené k vícenásobnému lemmatu.+The multiple lemma functions as a multi-valuewhich means that if we enter any one of the formsthe search returns all of the forms assigned to the multiple lemma.
 ===== Tag forms===== ===== Tag forms=====
-The form of the tags corresponds to that of the [[en:seznamy:tagy#pozice_1_-_slovni_druh|morphological tags]] used in the [[en:cnk:syn|SYN]] series written corpora before the simplification of the tagging system and does not include aspect in the 16th position.+The form of the tags corresponds to that of the [[seznamy:tagy#pozice_1_-_slovni_druh|morphological tags]] (Czech only) used in the [[en:cnk:syn|SYN]] series written corpora before the simplification of the tagging system and does not include aspect in the 16th position.
 Apart from these tags, the first position for the word class and the POS attribute can have the following values: Apart from these tags, the first position for the word class and the POS attribute can have the following values:
Line 59: Line 59:
 ===== Acknowledgements ===== ===== Acknowledgements =====
-We would like to thank doc. Klára Osolsobě and Mgr. Dana Hlaváčková, Ph.D. for providing valuable consultations. +We would like to thank doc. Klára Osolsobě and Dr. Dana Hlaváčková for providing valuable consultations. 
 ===== Sources ===== ===== Sources =====
Line 73: Line 73:
 <WRAP round box 72%> <WRAP round box 72%>
-[[en:cnk:oral|ORAL]] • [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:dialekt|Dialekt]] • [[en:pojmy:mluveny|Spoken language corpus]] • [[en:pojmy:atributy_strukturni#strukturni_atributy_korpusu_rady_oral|Structure of the ORAL corpora]] • [[en:kurz:hledani_v_mluvenych_korpusech|Searching in spoken corpora]] • [[en:kurz:hledani_ORTOFON|Searching in the ORTOFON corpus]] • [[en:cnk:dialekt:prace|Searching in the DIALEKT corpus]]+[[en:cnk:oral|ORAL]] • [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:dialekt|DIALEKT]]
  </WRAP>  </WRAP>