| Both sides previous revisionPrevious revision | |
| en:cnk:lemtag_mluv [2017/07/18 15:12] – [Lemmatization and tagging in spoken corpora] michalkren | en:cnk:lemtag_mluv [2025/06/06 13:37] (current) – [Lemmatization and tagging in spoken corpora] martinawaclawicova |
|---|
| With regard to the size of the corpus, lemmatization, in particular, is indispensable, without which it would be virtually impossible to find all forms of a given lemma (for example reduced and dialectal forms). Due to the mode of transcription (namely the use of pause punctuation in the place of syntactic punctuation), tools and procedures commonly used for written language cannot be used. | With regard to the size of the corpus, lemmatization, in particular, is indispensable, without which it would be virtually impossible to find all forms of a given lemma (for example reduced and dialectal forms). Due to the mode of transcription (namely the use of pause punctuation in the place of syntactic punctuation), tools and procedures commonly used for written language cannot be used. |
| |
| The following method for lemmatization and morphological tagging has been used for the [[en:cnk:oral|ORAL]], [[en:cnk:ortofon|ORTOFON]] and [[en:cnk:dialekt|DIALEKT]] corpora. Its primary contribution lies in the lemmatization and tagging of word classes. However, it is recommended to carefully double check this when searching, especially in the case of morphological categories. | The following method for lemmatization and morphological tagging has been used for the [[en:cnk:oral|ORAL]], [[en:cnk:ortofon|ORTOFON]], [[en:cnk:orator|ORATOR]] and [[en:cnk:dialekt|DIALEKT]] corpora. Its primary contribution lies in the lemmatization and tagging of word classes. However, it is recommended to carefully double check this when searching, especially in the case of morphological categories. |
| |
| The lemmatization and tagging method used, although laborious, is only the first attempt to facilitate working with the extensive data of the ORAL, ORTOFON and DIALEKT corpora, and as such contains errors and inaccuracies, which in turn lead to more general questions regarding the notion of tagging spoken corpora. It is hoped that these will be solved in the upcoming versions by the creation of a new tagging scheme (NovaMorf) and tools for its implementation. Despite these shortcomings, this corpus provides vastly improved conditions for working with spoken data in comparison to previous corpora. | The lemmatization and tagging method used, although laborious, is only the first attempt to facilitate working with the extensive data of the ORAL, ORTOFON, ORATOR and DIALEKT corpora, and as such contains errors and inaccuracies, which in turn lead to more general questions regarding the notion of tagging spoken corpora. It is hoped that these will be solved in the upcoming versions by the creation of a new tagging scheme (NovaMorf) and tools for its implementation. Despite these shortcomings, this corpus provides vastly improved conditions for working with spoken data in comparison to previous corpora. |
| |
| **Concept of lemma** | **Concept of lemma** |