Lemma

A lemma is a representative dictionary form of a word, and in the proces of lemmatization during automatic language processing it is the form which is assigned to every form of the given word in the corpus.

Approaches to lemmatization can differ in specific details, but it is generally the case that:

the lemma of every Czech noun is its nom. sg. (the forms lesům, lesy, lesích have the lemma les)
for adjectives it is nom. sg. masc. (the forms chytrého, chytrou, chytrejma have the lemma chytrý)
for verbs it is the infinitive (the forms chodil, chodíš, chodíme have the lemma chodit)

The lemma as a unit originates from an abstraction of a word form's morphological characteristics, and represents a set of forms which have the same root and differ only on their respective morphological affixes or orthographic form. In some approaches, the selected morphological variants are also associated with the lemma.

The link between a lemma and lexeme

In certain respects a lemma is the same as a lexeme, differing only in the fact that the lemma is always a single word unit and in most corpora corresponds to the word form occurrences in a 1:1 ratio. Therefore in every corpus the number of lemmas is always smaller than the number of word forms (e.g. in the 100 mil. word corpus SYN2010 we find 1,7 mil. different word forms, but only 786 000 different lemmas).

The relation between a lemma and meaning

A lemma should be the basic bearer of a unit's lexical meaning. This is why corpus-based dictionaries are compiled based on lemmas. Simultaneously, there is increasingly greater emphasis is placed on an approach which points out that meaning is closely linked to a morphologically defined form, and that the lemma, with it's excessive abstraction, neglects some important semantic distinctions between forms.

Hyperlemmas and lemmatization of diachronic texts

Approaches to lemmatization can differ in selected cases. One such case is the processing of diachronic, dialektological or spoken corpora, where the need to assemble word forms under one unit can be influenced by criteria other than simply falling under one morphological paradigm. However, it is always the case that a lemma is only a tool for more accessible searching, and not for the description or interpretation of language data.

In the case of the diachronic corpus DIAKORP, lemmatization with the help of so-called hyperlemmas is planned in the future. This will enable the users to find all occurrences of the given lexeme regardless of its various forms (historical and orthographic variants).

Lemmatizace

Lemmatizace je součást automatické morfologické (vč. slovnědruhové) anotace. Principem lemmatizace je přiřazení lemmatu jednomu slovnímu tvaru (příp. skupině slovních tvarů) v korpusu.

Lemmatizace je typicky součástí procesu desambiguace (zjednoznačnění) slovních tvarů v textu na základě kontextu. Lemmatizace je jednoduchá a nekontextová (patří-li lemmatizovaný slovní tvar k paradigmatu jediného lexému, např. slovesnému tvaru vytvoříme bude přiřazeno lemma vytvořit jakožto reprezentativní podoba příslušného slovesného lexému bez ohledu na kontext).

Automatická lemmatizace je naopak nesnadná, je-li lemmatizovaný slovní tvar homonymní, tj. patří-li k paradigmatům více lexémů: např. tvar zvířenou náleží jednak paradigmatu adjektivního lexému zvířený, jednak paradigmatu substantivního lexému zvířena. V tomto případě se v procesu lemmatizace na základě kontextu rozhodne, které z potenciálních lemmat se danému tvaru přiřadí. U lexikálních homonym, jejichž morfologické paradigma je totožné, se někdy při lemmatizaci rozlišuje mezi jednotlivými lexikálními významy homonyma, např. travička_1 vs. travička_2.

Problems with lemmatization

Velkým lingvistickým i počítačovým problémem je lemmatizace víceslovných spojení. Jiným neřešeným problémem při automatické lemmatizaci je lemmatizace všech tvarů pod jediné lemma i tam, kde to není patřičné: např. zdvořilá prosba o dovolení projít Dovolíte? se neodráží v žádném z registrovaných významů slova dovolit, protože není součástí výlučně morfologické lemmatizace. Podobné je to i u frazémů, kde nelze tvar holičkách (frazému nechat na holičkách) lemmatizovat jako holičky (tvar, který navíc vůbec neexistuje).

The lemmatization process

Automatickou lemmatizaci provádí počítačový program zvaný lemmatizátor, který bývá součástí morfologického taggeru, provádějícího morfologickou desambiguaci textu. Smyslem lemmatizace je jednak identifikovat v daném kontextu náležitý lexém u homonymních slovních tvarů, jednak umožnit uživateli pracovat nikoli jen se slovními tvary, nýbrž i s lemmaty jakožto reprezentanty příslušných lexémů a jejich paradigmat, což mu podstatně usnadňuje práci s korpusem.