AplikaceAplikace
Nastavení

This is an old revision of the document!


Unified CNC Annotation Scheme (Tokenization, Lemmatization, Morphology)

The Czech National Corpus uses in its synchronous written corpora (starting with SYN2020 and SYN_v9, followed with e.g. NET and ONLINE copora) as well as in its spoken corpora (Ortofon_v3) a unified annotation scheme for morphological tagging and lemmatization. The annotation standard includes tokenization (defining tokens in text), lemmatization (basic dictionary forms of tokens), and morphological tagging including special tags for verb forms.

Tokenization

Numeric and punctuation characters are systematically separated as individual tokens (at the split point, the structure <g/> is annotated, preserving information about the flow of the original text). However, some combinations of characters remain together according to predefined rules and word lists (e.g., words like česko-německý “Czech–German”, wi-fi “wi-fi”, r’n’b “R’n’B”, Jang-c’-ťiang “Yangtze”, CO2 “CO2”, 12letý “12‑year‑old” are tokenized together). These principles are presented on the page tokenization (in Czech).

Tokenization, Lemmatization and Tagging of Multiword Tokens

In the CNC annotation scheme, special treatment is given to groups of words such as nač “for what”, pročs “why did you”, or kdybychom “if we would”, which are written as one word but from the viewpoint of syntax or grammatical categories behave as two (rarely three) words. These words are tokenized as a single token, but for the purposes of morphological tagging, lemmatization (and in some corpora also syntactic tagging), they are treated as two or three words. Thus, these tokens receive two (or three) lemmas, sublemmas, tags, and verbtags according to their component parts. This concerns conditional conjunctions (aby “so that”, kdyby “if”), combinations with the clitic auxiliary s “you are” (dělalas “you (fem.) were doing”, viděls “you (masc.) saw”, komus “to whom you”, vždyťs “but you”), combinations of prepositions with certain pronouns (nač “for what”, očpak “what about”, zaň “for him”), and some combinations of the above (načs “for what you”). Each such word is assigned two (or three) lemmas, sublemmas, tags and verbtags corresponding to their individual parts. More details on multiword tokens can be found at multiword tokens (in Czech). A similar approach to such tokens is used in the Universal Dependencies standard.

Lemmatization

The annotation scheme uses two‑level lemmatization: each form has, in addition to the lemma attribute, an attribute sublemma. While a lemma groups multiple variants of a single word (e.g., lemma filozofie “philosophy” represents all forms with the stems filozof and filosof), sublemmas distinguish subsets of forms according to this variation (sublemma filozofie represents only forms with stem filozof, sublemma filosofie only forms with stem filosof). If a word is non‑variant, the sublemma is identical to the lemma (e.g., lemma kniha “book” represents the same set of forms as sublemma kniha). As sublemmas, various types of variants are distinguished (e.g., mýdlo/mejdlo “soap”, okno/vokno “window”, citron/citrón “lemon”, email/e-mail “email”, myslet/myslit “to think”, mýt/mejt “to wash”, péci/péct/píct “to bake”, kuchyně/kuchyň “kitchen”, antivirus/antivir “antivirus”, sedm/sedum “seven”, tenhle/tendle/tenle “this one”, ačkoli/ačkoliv “although”, proper names Robert/Róbert/Roberto, Atény/Athény “Athens”) and they are also used to differentiate some specific groups of forms that are traditionally grouped under a single lemma (e.g., negated forms of adjectives and adverbs černý/nečerný “black/not black”, hezky/nehezky “nicely/not nicely”, nominal adjective forms mladý/mlád “young”, suppletive forms dobře/lépe/líp “well/better/best”, člověk/lidé “person/people”). A detailed description is provided on the page lemmatization (in Czech).

Morphological Tagging (tag)

The morphological tag in the uniform CNC annotation scheme has 15 positions. The tags are based on the tagging used in the Prague Dependency Treebank PDT‑C, with several differences resulting from a different approach to several phenomena in CNC, different tokenization, etc. In particular, the part‑of‑speech categorization of some words and forms has been re‑evaluated (especially in numerals, predicatives, and nominal adjective forms), and further differences occur on the 2nd position (detailed part‑of‑speech specification). A detailed overview of morphological tagging is provided on the page morphological tags and their values (in Czech). A brief overview in English can be found here.

Tagging of Verb Forms: The Verbtag Attribute

A special attribute verbtag contains morphological information about the entire verbal form, regardless of whether the form is compound (viděl jsem “I saw”) or simple (vidím “I see”). In the verbtag, the auxiliary verb is distinguished from the main verb, and for each main verb form the categories of mood, voice, person, number, and tense (valid for the entire verbal form) are also included. The verbtag is provided for every token in the corpus, but it receives values only for verbs (and in one exceptional case for deverbal adjectives). A complete introduction is available on the page verb category tags (verbtags) and their values (in Czech).