| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:cnk:lemtag_mluv [2017/07/07 11:02] – [Alterations to the morphological dictionary] veronikapojarova | en:cnk:lemtag_mluv [2025/06/06 13:37] (current) – [Lemmatization and tagging in spoken corpora] martinawaclawicova |
|---|
| With regard to the size of the corpus, lemmatization, in particular, is indispensable, without which it would be virtually impossible to find all forms of a given lemma (for example reduced and dialectal forms). Due to the mode of transcription (namely the use of pause punctuation in the place of syntactic punctuation), tools and procedures commonly used for written language cannot be used. | With regard to the size of the corpus, lemmatization, in particular, is indispensable, without which it would be virtually impossible to find all forms of a given lemma (for example reduced and dialectal forms). Due to the mode of transcription (namely the use of pause punctuation in the place of syntactic punctuation), tools and procedures commonly used for written language cannot be used. |
| |
| The following method for lemmatization and morphological tagging has been used for the [[en:cnk:oral|ORAL]], [[en:cnk:ortofon|ORTOFON]] and [[en:cnk:dialekt|DIALEKT]] corpora. Its primary contribution lies in the lemmatization and tagging of word classes. However, it is recommended to carefully double check this when searching, especially in the case of morphological categories. | The following method for lemmatization and morphological tagging has been used for the [[en:cnk:oral|ORAL]], [[en:cnk:ortofon|ORTOFON]], [[en:cnk:orator|ORATOR]] and [[en:cnk:dialekt|DIALEKT]] corpora. Its primary contribution lies in the lemmatization and tagging of word classes. However, it is recommended to carefully double check this when searching, especially in the case of morphological categories. |
| |
| The lemmatization and tagging method used, although laborious, is only the first attempt to facilitate working with the extensive data of the ORAL, ORTOFON and DIALEKT corpora, and as such contains errors and inaccuracies, which in turn lead to more general questions regarding the notion of tagging spoken corpora. It is hoped that these will be solved in the upcoming versions by the creation of a new tagging scheme (NovaMorf) and tools for its implementation. Despite these shortcomings, this corpus provides vastly improved conditions for working with spoken data in comparison to previous corpora. | The lemmatization and tagging method used, although laborious, is only the first attempt to facilitate working with the extensive data of the ORAL, ORTOFON, ORATOR and DIALEKT corpora, and as such contains errors and inaccuracies, which in turn lead to more general questions regarding the notion of tagging spoken corpora. It is hoped that these will be solved in the upcoming versions by the creation of a new tagging scheme (NovaMorf) and tools for its implementation. Despite these shortcomings, this corpus provides vastly improved conditions for working with spoken data in comparison to previous corpora. |
| |
| **Concept of lemma** | **Concept of lemma** |
| **Tagging method** | **Tagging method** |
| |
| [[en:seznamy:tagy#pozice_1_-_slovni_druh|The morphological tagging system]] is the same as for written corpora, however, some tags for associated categories are retained (e.g. X for any gender, Y for masculine animate or inanimate etc.) just as they are contained in the morphological dictionary MorfFlex CZ (Hajič–Hlaváčová, 2013). This dictionary was manually and semiautomatically supplemented by frequently unrecognised forms (e.g. dialectal suffixes, forms with varying quantity, prothetic v). The stochastic tagging system MorphoDiTa (Straka a kol., 2014) was used for the tagging itself. | [[seznamy:tagy#pozice_1_-_slovni_druh|The morphological tagging system]] (the description is in Czech only) is the same as for written corpora, however, some tags for associated categories are retained (e.g. X for any gender, Y for masculine animate or inanimate etc.) just as they are contained in the morphological dictionary MorfFlex CZ (Hajič–Hlaváčová, 2013). This dictionary was manually and semiautomatically supplemented by frequently unrecognised forms (e.g. dialectal suffixes, forms with varying quantity, prothetic v). The stochastic tagging system MorphoDiTa (Straka a kol., 2014) was used for the tagging itself. |
| |
| ===== Modifications to the morphological dictionary ===== | ===== Modifications to the morphological dictionary ===== |
| |
| **Manual additions**: | **Manual additions**: |
| * assigning and merging **výslovnostních variant** (např. //třeba, čovek, depák; dokavád, dovaď, dovad//) pod jedno lemma | * assigning and merging **pronunciation variants** (e.g. //třeba, čovek, depák; dokavád, dovaď, dovad//) into one single lemma |
| * přiřazení **nářečních podob** (//dňama, Davidoj, ňou//) ke spisovnému lemmatu | * assigning **dialectal forms** (//dňama, Davidoj, ňou//) to a standard lemma |
| |
| |
| **Odstranění některých interpretací**: | **Removal of selected interpretations**: |
| * odstranění interpretace výrazu jako adverbium: //prostě// | * removal of the expression's adverbial interpretation: //prostě// |
| * odstranění interpretace výrazu jako imperativ: //viď// | * removal of the expression's imperative interpretation: //viď// |
| * odstranění interpretace výrazu jako vokativ: //pote// (redukovaná výslovnost //pojďte//) | * removal of the expression's vocative interpretation: //pote// (reduced pronunciation of //pojďte//) |
| |
| **Doplnění některých interpretací** | **Addition of selected interpretations** |
| * přidání kategorie částice: //jen// (původně pouze adverbium) | * addition of the particle category: //jen// (originally only adverb) |
| * změna interpretace: //puč// neoznačujeme jako substantivum, jedná se o redukovanou výslovnost imperativu slovesa //půjčit// | * new interpretation: //puč// is no longer a noun, but the imperative of the verb //půjčit// with reduced pronunciation |
| ===== Podoba lemmatu ===== | ===== Lemma forms ===== |
| |
| * většina slov má lemma v podobě **spisovného lemmatu**, tedy stejnou jako v psaném jazyce, a to i v případech, kdy regionální podoba frekvenčně převažuje (např. pod lemma **//týden//** spadají všechny tvary regionálních variant //tejden, tyden, tydeň, tédeň//) | * most words have a lemma in the form of a **standard lemma**, i.e. the same as in written language, even in cases where the regional form has a higher frequency (e.g. the lemma **//týden//** subsumes all regional variant forms //tejden, tyden, tydeň, tédeň//) |
| * slova s** dvojí spisovnou podobou** mají vícenásobné lemma (//polívka/polévka//) | * words with a **dual standard form** have a multiple lemma (//polívka/polévka//) |
| * slova, u nichž **nelze jednoznačně přiřadit jednotlivé tvary**, mají také vícenásobné lemma (//myslet/myslit, muset/musit//) | * words which **can not be unambiguously assigned one specific form**, also have a multiple lemma (//myslet/myslit, muset/musit//) |
| * **zkratky** mají vícenásobné lemma: //SMS/esemeska, endéer/NDR// | * **abbreviations** have a multiple lemma: //SMS/esemeska, endéer/NDR// |
| |
| Vícenásobné lemma funguje jako multihodnota, to znamená, že při zadání jedné z možností vždy dostaneme všechny tvary přiřazené k vícenásobnému lemmatu. | The multiple lemma functions as a multi-value, which means that if we enter any one of the forms, the search returns all of the forms assigned to the multiple lemma. |
| |
| |
| ===== Tag forms===== | ===== Tag forms===== |
| |
| The form of the tags corresponds to that of the [[en:seznamy:tagy#pozice_1_-_slovni_druh|morphological tags]] used in the [[en:cnk:syn|SYN]] series written corpora before the simplification of the tagging system and does not include aspect in the 16th position. | The form of the tags corresponds to that of the [[seznamy:tagy#pozice_1_-_slovni_druh|morphological tags]] (Czech only) used in the [[en:cnk:syn|SYN]] series written corpora before the simplification of the tagging system and does not include aspect in the 16th position. |
| Apart from these tags, the first position for the word class and the POS attribute can have the following values: | Apart from these tags, the first position for the word class and the POS attribute can have the following values: |
| |
| |
| ===== Acknowledgements ===== | ===== Acknowledgements ===== |
| We would like to thank doc. Klára Osolsobě and Mgr. Dana Hlaváčková, Ph.D. for providing valuable consultation. | We would like to thank doc. Klára Osolsobě and Dr. Dana Hlaváčková for providing valuable consultations. |
| |
| ===== Sources ===== | ===== Sources ===== |
| |
| <WRAP round box 72%> | <WRAP round box 72%> |
| [[en:cnk:oral|ORAL]] • [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:dialekt|Dialekt]] • [[en:pojmy:mluveny|Spoken language corpus]] • [[en:pojmy:atributy_strukturni#strukturni_atributy_korpusu_rady_oral|Structure of the ORAL corpora]] • [[en:kurz:hledani_v_mluvenych_korpusech|Searching in spoken corpora]] • [[en:kurz:hledani_ORTOFON|Searching in the ORTOFON corpus]] • [[en:cnk:dialekt:prace|Searching in the DIALEKT corpus]] | [[en:cnk:oral|ORAL]] • [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:dialekt|DIALEKT]] |
| </WRAP> | </WRAP> |