Starting with the SYN2020 corpus, there is two-level lemmatization in Czech corpora: each form is given the sublemma attribute in addition to the lemma attribute. While a lemma may include multiple variants of a single word (e.g. the lemma //filozof// represents all forms with both //filozof// and //filosof// stems), sublemmata delimit subgroups of forms according to this alternation (the sublemma //filozof// represents only forms with the stem //filozof//, while the sublemma //filosof// represents only forms with the stem //filosof//). If the word has no variants, the sublemma is identical to the lemma (e.g. the lemma //kniha// represents the same set of forms as the sublemma //kniha//). | Starting with the SYN2020 corpus, Czech corpora feature two-level lemmatization: each form is given a sublemma attribute in addition to the lemma attribute. While a lemma may include multiple variants of a single word (e.g. the lemma //filozof// represents all forms with both //filozof// and //filosof// stems), sublemmas delimit subgroups of forms according to this alternation (the sublemma //filozof// represents only forms with the stem //filozof//, while the sublemma //filosof// represents only forms with the stem //filosof//). If the word has no variants, the sublemma is identical to the lemma (e.g. the lemma //kniha// represents the same set of forms as the sublemma //kniha//). |
Different types of variants are handled as sublemmata (e.g. //mýdlo/mejdlo//, //okno/vokno//, //citron/citrón//, //email/e-mail//, //myslet/myslit//, //mýt/mejt//, //péci/péct/píct//, //kuchyně/kuchyň//, //antivirus/antivir//, //sedm/sedum//, //tenhle/tendle/tenle//, //ačkoli/ačkoliv//, proper names //Robert/Róbert/Roberto//, //Atény/Athény//). Sublemmata are aldo used to distinguish some specific groups of forms that are included under one lemma (e.g. negated forms of adjectives and adverbs //černý/nečerný//, //hezky/nehezky//, short forms of adjectives //mladý/mlád//, suppletive forms //dobře/lépe/líp//, //člověk/lidé//). | Different types of variants are handled as sublemmas (e.g. //mýdlo/mejdlo//, //okno/vokno//, //citron/citrón//, //email/e-mail//, //myslet/myslit//, //mýt/mejt//, //péci/péct/píct//, //kuchyně/kuchyň//, //antivirus/antivir//, //sedm/sedum//, //tenhle/tendle/tenle//, //ačkoli/ačkoliv//, proper names //Robert/Róbert/Roberto//, //Atény/Athény//). Sublemmas are also used to distinguish some specific groups of forms that are subsumed under one lemma (e.g. negated forms of adjectives and adverbs //černý/nečerný//, //hezky/nehezky//, short forms of adjectives //mladý/mlád//, suppletive forms //dobře/lépe/líp//, //člověk/lidé//). |