Lemma

Lemma

A lemma is a representative dictionary form of a word, and in the proces of lemmatization during automatic language processing it is the form which is assigned to every form of the given word in the corpus.

Approaches to lemmatization can differ in specific details, but it is generally the case that:

the lemma of every Czech noun is its nom. sg. (the forms lesům, lesy, lesích have the lemma les)
for adjectives it is nom. sg. masc. (the forms chytrého, chytrou, chytrejma have the lemma chytrý)
for verbs it is the infinitive (the forms chodil, chodíš, chodíme have the lemma chodit)

The lemma as a unit originates from an abstraction of a word form's morphological characteristics, and represents a set of forms which have the same root and differ only on their respective morphological affixes or orthographic form. In some approaches, the selected morphological variants are also associated with the lemma.

Sublemma

Starting with the SYN2020 corpus, Czech corpora feature two-level lemmatization: each form is given a sublemma attribute in addition to the lemma attribute. While a lemma may include multiple variants of a single word (e.g. the lemma filozof represents all forms with both filozof and filosof stems), sublemmas delimit subgroups of forms according to this alternation (the sublemma filozof represents only forms with the stem filozof, while the sublemma filosof represents only forms with the stem filosof). If the word has no variants, the sublemma is identical to the lemma (e.g. the lemma kniha represents the same set of forms as the sublemma kniha).

Different types of variants are handled as sublemmas (e.g. mýdlo/mejdlo, okno/vokno, citron/citrón, email/e-mail, myslet/myslit, mýt/mejt, péci/péct/píct, kuchyně/kuchyň, antivirus/antivir, sedm/sedum, tenhle/tendle/tenle, ačkoli/ačkoliv, proper names Robert/Róbert/Roberto, Atény/Athény). Sublemmas are also used to distinguish some specific groups of forms that are subsumed under one lemma (e.g. negated forms of adjectives and adverbs černý/nečerný, hezky/nehezky, short forms of adjectives mladý/mlád, suppletive forms dobře/lépe/líp, člověk/lidé).

The link between a lemma and lexeme

In certain respects a lemma is the same as a lexeme, differing only in the fact that the lemma is always a single word unit and in most corpora corresponds to the word form occurrences in a 1:1 ratio. Therefore in every corpus the number of lemmas is always smaller than the number of word forms (e.g. in the 100 mil. word corpus SYN2010 we find 1,7 mil. different word forms, but only 786 000 different lemmas).

The relation between a lemma and meaning

A lemma should be the basic bearer of a unit's lexical meaning. This is why corpus-based dictionaries are compiled based on lemmas. Simultaneously, there is increasingly greater emphasis is placed on an approach which points out that meaning is closely linked to a morphologically defined form, and that the lemma, with it's excessive abstraction, neglects some important semantic distinctions between forms.

Hyperlemmas and lemmatization of diachronic texts

Approaches to lemmatization can differ in selected cases. One such case is the processing of diachronic, dialektological or spoken corpora, where the need to assemble word forms under one unit can be influenced by criteria other than simply falling under one morphological paradigm. However, it is always the case that a lemma is only a tool for more accessible searching, and not for the description or interpretation of language data.

In the case of the diachronic corpus DIAKORP, lemmatization with the help of so-called hyperlemmas is planned in the future. This will enable the users to find all occurrences of the given lexeme regardless of its various forms (historical and orthographic variants).

Lemmatization

Lemmatization is a part of the process of morphological (incl. word class) annotation. The principle of lemmatization is the assignment of a lemma to one word form (or. group of word forms) in the corpus.

Lemmatization is typically part of the context-based disambiguation process of word forms in a text. Lemmatization is simple and independent of context (if the lemmatized word form belongs to the paradigm of one single lexeme, e.g. the verb form believed will be assigned the lemma believe as a representative form of the verbal lexeme regardless of context).

On the other hand, automatized lemmatization is problematic when the lemmatized word form is homonymous, i.e. it belongs to the paradigms of more than one lexeme: e.g. the form saw belongs both the the paradigm of the verbal lexeme see, and to the paradigm of the noun lexeme saw. In this case the assigned lemma is decided based on context.

Problems with lemmatization

One of the biggest linguistic and computational problems is the lemmatization of multiword expressions. Another problem of automatic lemmatization which remains unsolved is the lemmatization of all forms under one lemma even in cases where it is not appropriate e.g. Cheers!, when no registered meaning of the word cheer corresponds with the pragmatic meaning, because it does not fall under strictly morphological lemmatization.

The lemmatization process

Automatic lemmatization is done by a computer program called a lemmatizátor, which is often part of a morphological tagger carrying out the disambiguation of the text. The purpose of lemmatization is firstly to identify in a given context the appropriate lexeme among homonymous word forms, and secondly to enable the user to work not only with word forms, but also lemmas as representations of the given lexemes and their paradigms, all of which facilitates work with the corpus.