Differences

This shows you the differences between two versions of the page.

--- en:pojmy:lemma [2016/12/09 21:44] – [Problems with lemmatization] veronikapojarova
+++ en:pojmy:lemma [2022/04/13 14:51] – jankrivan
@@ Line 9: / Line 9: @@
 The lemma as a unit originates from an abstraction of a [[en:pojmy:word|word form's]] morphological characteristics, and represents a set of forms which have the same root and differ only on their respective morphological affixes or orthographic form. In some approaches, the selected morphological variants are also associated with the lemma.
+====== Sublemma ======
+Starting with the SYN2020 corpus, lemmatization in Czech corpora is two-tiered: each form is given a sublemma attribute in addition to the lemma attribute. While a lemma can associate multiple variants of a single word (e.g. the lemma //filozof// represents all forms with both //filozof// and //filosof// stems), sublemmata delimit subgroups of forms according to this alternation (the sublemma //filozof// represents only forms with the stem //filozof//, the sublemma //filosof// represents only forms with the stem //filosof//). If the word is non-variant, the sublemma is identical to the lemma (e.g. a lemma //kniha// represents the same set of forms as a sublemma //kniha//).
+Different types of variants are handled as sublemmata (e.g. //mýdlo/mejdlo//, //okno/vokno//, //citron/citrón//, //email/e-mail//, //myslet/myslit//, //mýt/mejt//, //péci/péct/píct//, //kuchyně/kuchyň//, //antivirus/antivir//, //sedm/sedum//, //tenhle/tendle/tenle//, //ačkoli/ačkoliv//, proper names //Robert/Róbert/Roberto//, //Atény/Athény//) and they are used to differentiate some specific groups of forms that are included under one lemma (e.g. negated forms of adjectives and adverbs //černý/nečerný//, //hezky/nehezky//, nominal forms of adjectives //mladý/mlád//, suppletion //dobře/lépe/líp//, //člověk/lidé//).
 ===== The link between a lemma and lexeme =====
@@ Line 38: / Line 44: @@
 ==== The lemmatization process ====
-Automatickou lemmatizaci provádí počítačový program zvaný //lemmatizátor//, který bývá součástí morfologického [[pojmy:tag|taggeru]], provádějícího morfologickou [[pojmy:desambiguace|desambiguaci]] textu. Smyslem lemmatizace je jednak identifikovat v daném kontextu náležitý lexém u homonymních slovních tvarů, jednak umožnit uživateli pracovat nikoli jen se slovními tvary, nýbrž i s lemmaty jakožto reprezentanty příslušných lexémů a jejich paradigmat, což mu podstatně usnadňuje práci s korpusem.
+Automatic lemmatization is done by a computer program called a //lemmatizátor//, which is often part of a morphological [[en:pojmy:tag|tagger]] carrying out the [[en:pojmy:desambiguace|disambiguation]] of the text. The purpose of lemmatization is firstly to identify in a given context the appropriate lexeme among homonymous word forms, and secondly to enable the user to work not only with word forms, but also lemmas as representations of the given lexemes and their paradigms, all of which facilitates work with the corpus.
 ==== Related links ====
 <WRAP round box 49%>
-[[en:pojmy:anotace|Annotation]] • [[en:pojmy:desambiguace|Disambiguation]] • [[en:pojmy:tag|Tags and tagging]] • [[en:pojmy:word|Word form
+[[en:pojmy:anotace|Annotation]] • [[en:pojmy:desambiguace|Disambiguation]] • [[en:pojmy:tag|Tags and tagging]] • [[en:pojmy:word|Word form]]
 </WRAP>

Trace:

Differences

Search

Navigation

Print/export

Tools

Languages

Licence