Lemmatization and tagging in spoken corpora

Lemmatization and tagging in spoken corpora

Lemmatizing and tagging a transcription of spoken language is much more demanding than for written language. There is a larger amount of unknown forms (reduced pronunciation, dialectal forms, neologisms), which can be homonymous with forms contained in the morphological dictionary for written language (e.g. pudu as recorded pronunciation of the verbal form půjdu is homonymous with the dative and locative forms of the noun pud and with the recorded shortened pronunciation of the plural genitive of the noun půda). The distinct structure of informal dialogue, which is characterized by the quantity of unfinished, interrupted and modified utterances, repeated words, filler sounds, references to the extralinguistic context etc., makes the identification of morphological categories difficult even for linguists. With regard to the size of the corpus, lemmatization, in particular, is indispensable, without which it would be virtually impossible to find all forms of a given lemma (for example reduced and dialectal forms). Due to the mode of transcription (namely the use of pause punctuation in the place of syntactic punctuation), tools and procedures commonly used for written language cannot be used.

The following method for lemmatization and morphological tagging has been used for the ORAL, ORTOFON, ORATOR and DIALEKT corpora. Its primary contribution lies in the lemmatization and tagging of word classes. However, it is recommended to carefully double check this when searching, especially in the case of morphological categories.

The lemmatization and tagging method used, although laborious, is only the first attempt to facilitate working with the extensive data of the ORAL, ORTOFON, ORATOR and DIALEKT corpora, and as such contains errors and inaccuracies, which in turn lead to more general questions regarding the notion of tagging spoken corpora. It is hoped that these will be solved in the upcoming versions by the creation of a new tagging scheme (NovaMorf) and tools for its implementation. Despite these shortcomings, this corpus provides vastly improved conditions for working with spoken data in comparison to previous corpora.

Concept of lemma

The concept of lemma is broader than in written language. The main priority is to be able to find all forms of a given words, which can be recorded with reduced pronunciation, but also dialectal forms, which could have a separate lemma (e.g. týden – tejden – tédeň – tydeň). Large-scale variation is typical namely for demonstrative pronouns e.g. lemma tenhleten contains 105 word forms (e.g. nom. sg. neut. can be written in the following ten ways: tohleto, todnecto, todleto, todlecto, todlencto, tohlencto, tohlento, toleto, tohlensto, todlensto).

Tagging method

The morphological tagging system (the description is in Czech only) is the same as for written corpora, however, some tags for associated categories are retained (e.g. X for any gender, Y for masculine animate or inanimate etc.) just as they are contained in the morphological dictionary MorfFlex CZ (Hajič–Hlaváčová, 2013). This dictionary was manually and semiautomatically supplemented by frequently unrecognised forms (e.g. dialectal suffixes, forms with varying quantity, prothetic v). The stochastic tagging system MorphoDiTa (Straka a kol., 2014) was used for the tagging itself.

Modifications to the morphological dictionary

The original morphological dictionary MorfFlex CZ (Hajič–Hlaváčová, 2013) was manually and semiautomatically supplemented, edited, and selected interpretations of grammatical categories were omitted with regard to the target register (e.g. the form bej in spoken data represents only a reduced variant of the verb být, and not a noun). Unrecognised forms were added from a frequency of 5 occurrences or higher. Examples of some modifications:

Semiautomatic additions:

dialectal suffixes such as the acc. sg. fem. ending with -u (nedělu, chvilu), verbal forms of the past active participle (dělale, chodile)
variants differing in vowel quantity (myslim, vim, makem, polivka), palatalization (tydeň), the presence of a prothetic v- (vokýnko, vobrazovka)
“mapping” unknown forms to familiar forms (with all of their morphological interpretations)

Manual additions:

assigning and merging pronunciation variants (e.g. třeba, čovek, depák; dokavád, dovaď, dovad) into one single lemma
assigning dialectal forms (dňama, Davidoj, ňou) to a standard lemma

Removal of selected interpretations:

removal of the expression's adverbial interpretation: prostě
removal of the expression's imperative interpretation: viď
removal of the expression's vocative interpretation: pote (reduced pronunciation of pojďte)

Addition of selected interpretations

addition of the particle category: jen (originally only adverb)
new interpretation: puč is no longer a noun, but the imperative of the verb půjčit with reduced pronunciation

Lemma forms

most words have a lemma in the form of a standard lemma, i.e. the same as in written language, even in cases where the regional form has a higher frequency (e.g. the lemma týden subsumes all regional variant forms tejden, tyden, tydeň, tédeň)
words with a dual standard form have a multiple lemma (polívka/polévka)
words which can not be unambiguously assigned one specific form, also have a multiple lemma (myslet/myslit, muset/musit)
abbreviations have a multiple lemma: SMS/esemeska, endéer/NDR

The multiple lemma functions as a multi-value, which means that if we enter any one of the forms, the search returns all of the forms assigned to the multiple lemma.

Tag forms

The form of the tags corresponds to that of the morphological tags (Czech only) used in the SYN series written corpora before the simplification of the tagging system and does not include aspect in the 16th position. Apart from these tags, the first position for the word class and the POS attribute can have the following values:

F for unfinished words (e.g. nepoda*)
H for non-verbal sounds (hesitations, marked @, responsive hmm, emm)
M for comments (always in round brackets)

Acknowledgements

We would like to thank doc. Klára Osolsobě and Dr. Dana Hlaváčková for providing valuable consultations.

Sources

Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P.: Korpus ORAL: sestavení, lemmatizace a morfologické značkování. In Korpus - Gramatika - Axiologie 2017 (in print).

Lukeš. D. - Klimešová, P. - Komrsková, Z. - Kopřivová, M. (2015): Experimental Tagging of the ORAL Series Corpora: Insights on Using a Stochastic Tagger. In: TSD 2015, Ed. P. Král a V. Matoušek. Springer International Publishing, 342-350.