AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:anotacni_standard_cnk [2026/01/15 11:22] tomasjelineken:cnk:anotacni_standard_cnk [2026/01/16 12:22] (current) krivan
Line 5: Line 5:
  
 ==== Tokenization ==== ==== Tokenization ====
-Numeric and punctuation characters are systematically separated as individual tokens (at the split point, the structure ''<g/>'' is annotated, preserving information about the flow of the original text). However, some combinations of characters remain together according to predefined rules and word lists (e.g., words like //česko-německý// Czech–German, //wi-fi// wi-fi, //r’n’b// R’n’B, //Jang-c’-ťiang// Yangtze, //CO2// CO2, //12letý// 12‑year‑old” are tokenized together). These principles are presented on the page [[cnk:syn2020:tokenizace|tokenization]] (in Czech).+Numeric and punctuation characters are systematically separated as individual tokens (at the split point, the structure ''<g/>'' is annotated, preserving information about the flow of the original text). However, some combinations of characters remain together according to predefined rules and word lists (e.g., words like //česko-německý// Czech–German, //wi-fi// wi-fi, //r’n’b// R’n’B, //Jang-c’-ťiang// Yangtze, //CO2// CO2, //12letý// 12‑year‑old’ are tokenized together). These principles are presented on the page [[cnk:syn2020:tokenizace|tokenization]] (in Czech).
  
 ==== Tokenization, Lemmatization and Tagging of Multiword Tokens ==== ==== Tokenization, Lemmatization and Tagging of Multiword Tokens ====
-In the CNC annotation scheme, special treatment is given to groups of words such as //nač// for what, //pročs// why did you, or //kdybychom// if we would, which are written as one word but from the viewpoint of syntax or grammatical categories behave as two (rarely three) words. These words are tokenized as a single token, but for the purposes of morphological tagging, lemmatization (and in some corpora also syntactic tagging), they are treated as two or three words. Thus, these tokens receive two (or three) lemmas, sublemmas, tags, and verbtags according to their component parts. +In the CNC annotation scheme, special treatment is given to groups of words such as //nač// for what, //pročs// why did you, or //kdybychom// if we would, which are written as one word but from the viewpoint of syntax or grammatical categories behave as two (rarely three) words. These words are tokenized as a single token, but for the purposes of morphological tagging, lemmatization (and in some corpora also syntactic tagging), they are treated as two or three words. Thus, these tokens receive two (or three) lemmas, sublemmas, tags, and verbtags according to their component parts. 
-This concerns conditional conjunctions (//aby// so that, //kdyby// if), combinations with the clitic auxiliary //s// you are” (//dělalas// you (fem.) were doing, //viděls// you (masc.) saw, //komus// to whom you, //vždyťs// but you), combinations of prepositions with certain pronouns (//nač// for what, //očpak// what about, //zaň// for him), and some combinations of the above (//načs// for what you). Each such word is assigned two (or three) lemmas, sublemmas, tags and verbtags corresponding to their individual parts. More details on multiword tokens can be found at [[cnk:syn2020:agregat|multiword tokens]] (in Czech). A similar approach to such tokens is used in the [[https://universaldependencies.org/|Universal Dependencies]] standard.+This concerns conditional conjunctions (//aby// so that, //kdyby// if), combinations with the clitic auxiliary //s// you are’ (//dělalas// you (fem.) were doing, //viděls// you (masc.) saw, //komus// to whom you, //vždyťs// but you), combinations of prepositions with certain pronouns (//nač// for what, //očpak// what about, //zaň// for him), and some combinations of the above (//načs// for what you). Each such word is assigned two (or three) lemmas, sublemmas, tags and verbtags corresponding to their individual parts. More details on multiword tokens can be found at [[cnk:syn2020:agregat|multiword tokens]] (in Czech). A similar approach to such tokens is used in the [[https://universaldependencies.org/|Universal Dependencies]] standard.
  
 ==== Lemmatization ==== ==== Lemmatization ====
-The annotation scheme uses two‑level lemmatization: each form has, in addition to the **lemma** attribute, an attribute **sublemma**. While a lemma groups multiple variants of a single word (e.g., lemma //filozofie// "philosophy" represents all forms with the stems //filozof// and //filosof//), sublemmas distinguish subsets of forms according to this variation (sublemma //filozofie// represents only forms with stem //filozof//, sublemma //filosofie// only forms with stem //filosof//). If a word is non‑variant, the sublemma is identical to the lemma (e.g., lemma //kniha// "book" represents the same set of forms as sublemma //kniha//). +The annotation scheme uses two‑level lemmatization: each form has, in addition to the **[[en:pojmy:lemma|lemma]]** attribute, an attribute **sublemma**. While a lemma groups multiple variants of a single word (e.g., lemma //filozofie// "philosophy" represents all forms with the stems //filozof// and //filosof//), sublemmas distinguish subsets of forms according to this variation (sublemma //filozofie// represents only forms with stem //filozof//, sublemma //filosofie// only forms with stem //filosof//). If a word is non‑variant, the sublemma is identical to the lemma (e.g., lemma //kniha// "book" represents the same set of forms as sublemma //kniha//). 
-As sublemmas, various types of variants are distinguished (e.g., //mýdlo/mejdlo// soap, //okno/vokno// window, //citron/citrón// lemon, //email/e-mail// email, //myslet/myslit// to think, //mýt/mejt// to wash, //péci/péct/píct// to bake, //kuchyně/kuchyň// kitchen, //antivirus/antivir// antivirus, //sedm/sedum// seven, //tenhle/tendle/tenle// this one, //ačkoli/ačkoliv// although, proper names //Robert/Róbert/Roberto//, //Atény/Athény// Athens) and they are also used to differentiate some specific groups of forms that are traditionally grouped under a single lemma (e.g., negated forms of adjectives and adverbs //černý/nečerný// black/not black, //hezky/nehezky// nicely/not nicely, nominal adjective forms //mladý/mlád// young, suppletive forms //dobře/lépe/líp// well/better/best, //člověk/lidé// person/people). +As sublemmas, various types of variants are distinguished (e.g., //mýdlo/mejdlo// soap, //okno/vokno// window, //citron/citrón// lemon, //email/e-mail// email, //myslet/myslit// to think, //mýt/mejt// to wash, //péci/péct/píct// to bake, //kuchyně/kuchyň// kitchen, //antivirus/antivir// antivirus, //sedm/sedum// seven, //tenhle/tendle/tenle// this one, //ačkoli/ačkoliv// although, proper names //Robert/Róbert/Roberto//, //Atény/Athény// Athens) and they are also used to differentiate some specific groups of forms that are traditionally grouped under a single lemma (e.g., negated forms of adjectives and adverbs //černý/nečerný// black/not black, //hezky/nehezky// nicely/not nicely, nominal adjective forms //mladý/mlád// young, suppletive forms //dobře/lépe/líp// well/better/best, //člověk/lidé// person/people). 
-A detailed description is provided on the page [[cnk:syn2020:lemmatizace]] (in Czech).+A detailed description is provided on the page [[cnk:syn2020:lemmatizace|lemmatization]] (in Czech).
  
 ==== Morphological Tagging (tag) ==== ==== Morphological Tagging (tag) ====
-The morphological **tag** in the uniform CNC annotation scheme has 15 positions. The tags are based on the tagging used in the Prague Dependency Treebank [[https://ufal.mff.cuni.cz/pdt-c/publications/TR_PDT_C_morph_manual.pdf|PDT‑C]], with several differences resulting from a different approach to several phenomena in CNC, different tokenization, etc. In particular, the part‑of‑speech categorization of some words and forms has been re‑evaluated (especially in numerals, predicatives, and nominal adjective forms), and further differences occur on the 2nd position (detailed part‑of‑speech specification). A detailed overview of morphological tagging is provided on the page [[seznamy:tagy#popis_jednotlivych_pozic_aktualni_morfologicke_znacky|morphological tags and their values]] (in Czech).+The morphological **tag** in the uniform CNC annotation scheme has 15 positions. The tags are based on the tagging used in the Prague Dependency Treebank [[https://ufal.mff.cuni.cz/pdt-c/publications/TR_PDT_C_morph_manual.pdf|PDT‑C]], with several differences resulting from a different approach to several phenomena in CNC, different tokenization, etc. In particular, the part‑of‑speech categorization of some words and forms has been re‑evaluated (especially in numerals, predicatives, and nominal adjective forms), and further differences occur on the 2nd position (detailed part‑of‑speech specification). A detailed overview of morphological tagging is provided on the page [[seznamy:tagy#popis_jednotlivych_pozic_aktualni_morfologicke_znacky|morphological tags and their values]] (in Czech). A brief overview can be found [[en:pojmy:tag|here]].
  
 ==== Tagging of Verb Forms: The Verbtag Attribute ==== ==== Tagging of Verb Forms: The Verbtag Attribute ====
-A special attribute **verbtag** contains morphological information about the entire verbal form, regardless of whether the form is compound (//viděl jsem// I saw) or simple (//vidím// I see). In the verbtag, the auxiliary verb is distinguished from the main verb, and for each main verb form the categories of mood, voice, person, number, and tense (valid for the entire verbal form) are also included. The verbtag is provided for every token in the corpus, but it receives values only for verbs (and in one exceptional case for deverbal adjectives). A complete introduction is available on the page [[seznamy:verbtagy|verb category tags (verbtags) and their values]] (in Czech).+A special attribute **verbtag** contains morphological information about the entire verbal form, regardless of whether the form is compound (//viděl jsem// I saw) or simple (//vidím// I see). In the verbtag, the auxiliary verb is distinguished from the main verb, and for each main verb form the categories of mood, voice, person, number, and tense (valid for the entire verbal form) are also included. The verbtag is provided for every token in the corpus, but it receives values only for verbs (and in one exceptional case for deverbal adjectives). A complete introduction is available on the page [[seznamy:verbtagy|verb category tags (verbtags) and their values]] (in Czech).