| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:cnk:syn2025 [2026/01/16 11:54] – [Concept of synchronicity] michalkren | en:cnk:syn2025 [2026/01/19 11:01] (current) – [Annotation of SYN2025] tomasjelinek |
|---|
| * the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2025, it is the period between 2020 and 2024). | * the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2025, it is the period between 2020 and 2024). |
| |
| ===== Annotation of SYN2020: changes compared to other corpora of the SYN series ===== | ===== Annotation of SYN2025 ===== |
| |
| ==== Tokenization ==== | Morphological tagging, lemmatization, and tokenization of the SYN2025 corpus are performed fully automatically according to the [[en:cnk:anotacni_standard_cnk|unified CNC annotation scheme]], which was already applied to the SYN2020 corpus.\\ |
| | The corpus is also provided with [[en:pojmy:syntakticka_analyza|syntactic annotation]] containing a number of attributes that express syntactic relations between tokens (e.g. parent, p_tag) in a sentence and the syntactic functions of the tokens (afun). |
| |
| In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (//řekl , že//) and some other combinations, such as the hyphen before the enclitic form //li// (//mohu - li//), have been tokenized in a separate way. | |
| |
| In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified as separate tokens, but some combinations of characters remain unseparated according to predefined rules and word lists (eg words such as //česko-německý//, //wi-fi//, //r’n’b//, //Jang-c’-ťiang//, //CO2//, //12letý//). These principles are/will be presented on the //tokenization// page. | ====== How to cite SYN2025 ====== |
| |
| ==== Lemmatization ==== | <WRAP round tip 70%> |
| | Křen, M. – Cvrček, V. – Čapka, T. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Marklová, A. – Petkevič, V. – Skoumalová, H. – Škrabal, M.: //SYN2025: reprezentativní korpus psané češtiny//. Ústav Českého národního korpusu FF UK, Praha 2025. Dostupný z WWW: http://www.korpus.cz |
| |
| A fundamental change in the annotation of the SYN2020 corpus is the introduction of the so-called **two-level lemmatization**: now each word form is assigned a **sublemma** attribute in addition to the **lemma** one. While the lemma associates several variants of one word in accordance with the earlier corpora of the SYN series (eg the //filozofie// lemma represents all the forms with the //filozof// and also //filosof// root), the sublemmas define subgroups of word forms with respect to this variability (sublemma //filozofie// represents only word forms with the //filozof// root, the //filosofie// sublemma only the forms associated with the //filosof// root). In case the word has no variants, the sublemma is identical to the lemma (eg the //kniha// lemma represents the same set of forms as the //kniha// sublemma). | Cvrček, V. – Čermáková, A. – Křen, M. (2016): Nová koncepce synchronních korpusů psané češtiny. //Slovo a slovesnost//, 77 (2), 83–101. |
| | |
| Different types of variants are accounted for as sublemmas (eg //mýdlo/mejdlo//, //okno/vokno//, //citron/citrón//, //email/e-mail//, //myslet/myslit//, //mýt/mejt//, //péci/péct/píct//, //kuchyně/kuchyň//, //antivirus/antivir//, //sedm/sedum//, //tenhle/tendle/tenle//, //ačkoli/ačkoliv//, proper names //Robert/Róbert/Roberto//, //Atény/Athény//) and by means of these sublemmas some specific groups of forms are distinguished that are traditionally covered under one lemma (eg negated forms of adjectives and adverbs //černý/nečerný//, //hezky/nehezky//, nominal forms of adjectives //mladý/mlád// and suppletive forms //dobře/lépe/líp//, //člověk/lidé//). | |
| | |
| In connection with these changes, the lemmatization was significantly refined compared to the previous corpora of the SYN series, many lemmas were corrected and other tens of thousands of lemmas are now recognizable in the SYN2020 corpus. A detailed description of the changes is presented on the lemmatization page. | |
| | |
| | |
| ==== Morphological tagging ==== | |
| | |
| From the SYN2020 corpus onwards, each morphological tag has **15 positions** (instead of the previous 16 ones). The annotation of verbal aspect is transferred from the canceled 16th position to the originally unused 13th one, otherwise the tag structure is identical to the structure present in existing corpora of the SYN series. | |
| | |
| The annotation changes themselves concern the following three positions in the tag. In the **1st position** (part of speech), the values **F** (foreign word), **B** (abbreviation) and **S** (segment) are now distinguished. At the same time, the part-of-speech classification of some words and forms was re-evaluated (especially in the category of numerals, predicatives and nominal forms of adjectives). In the 2nd position (detailed part-of-speech specification), new values were introduced in connection with the new parts of speech and some other were removed. The subdivision of numerals has been substantially modified (eg. the **z** value is now used for the numerals //sto//, //tisíc//, //milion// originally tagged as nouns) and the value **0** was added in order to identify non sentence-final punctuation. One change concerns the 15th position (variant): number 8 (so far reserved for abbreviations) is now used as a value coding another variant of colloquial Czech. | |
| | |
| The reliability of automatic lemmatization and morphological tagging of the SYN2020 corpus is significantly higher than was the case with previous corpora of the SYN series. | |
| | |
| A detailed overview of the changes is presented on the morphological marking page. | |
| | |
| ==== Verb tagging (verbtag) ==== | |
| | |
| A newly introduced verbal tag (verbtag) contains morphological information about the whole verb form, regardless of whether it is a compound form (//viděl jsem//) or a simple one (//vidím//). In the verbtag, on the one hand, the auxiliary verb differs from the autosemantic one, and on the other hand, for each autosemantic verb form, the following categories of manner, voice, person, number and tense are specified (valid for the whole verb form). The verb tag is assigned to each token in the corpus, but it takes appropriate values only for verbs (and with one exception for deverbal adjectives). For the full presentation of the verbtag, see the verbtag page. | |
| | |
| | |
| ==== Multiple lemmatization and tagging (aggregate) ==== | |
| | |
| In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates** ("multiword tokens" in the [[https://universaldependencies.org/|Universal Dependencies]] terminology), are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (//aby//, //kdyby//), the connection of words with the the enclitical form //s// (//dělalas//, //viděls//, //komus//, //vždyťs//), the connection of prepositions with some pronouns (//nač//, //očpak//, //zaň//), or a combination of words of the last two types (//načs//). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page. | |
| | |
| ==== Automatic corpus annotation ==== | |
| For SYN2020, the entire annotation process is automatic. Its detailed description including the annotation accuracy and a rich bibliography to both the tools and data can be found on a [[cnk:syn2020:automaticka_anotace|dedicated page]] (Czech only). | |
| | |
| ====== How to cite SYN2020 ====== | |
| <WRAP round tip 70%> | |
| Křen, M. – Cvrček, V. – Henyš, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //SYN2020: reprezentativní korpus psané češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020. Dostupný z WWW: http://www.korpus.cz | |
| |
| Jelínek, T. – Křivan, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. (2021): [[https://doi.org/10.1007/978-3-030-83527-9_4|SYN2020: A new corpus of Czech with an innovated annotation]]. In: K. Ekštein – F. Pártl – M. Konopík (eds.), //Text, Speech, and Dialogue.// TSD 2021. Lecture Notes in Computer Science, vol. 12848. Cham: Springer, 48–59. | Jelínek, T. – Křivan, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. (2021): [[https://doi.org/10.1007/978-3-030-83527-9_4|SYN2020: A new corpus of Czech with an innovated annotation]]. In: K. Ekštein – F. Pártl – M. Konopík (eds.), //Text, Speech, and Dialogue.// TSD 2021. Lecture Notes in Computer Science, vol. 12848. Cham: Springer, 48–59. |
| |
| Křivan, J. – Šindlerová, J. (2022): [[http://sas.ujc.cas.cz/archiv.php?lang=en&art=4508|Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu]]. //Slovo a slovesnost//, 83, 2/2022, 122–145. | Křivan, J. – Šindlerová, J. (2022): [[https://asjournals.lib.cas.cz/slovoaslovesnost/article/uuid:286197ce-8b36-43ac-9563-eba2abf8ca0e|Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu]]. //Slovo a slovesnost//, 83 (2), 122–145. |
| |
| </WRAP> | </WRAP> |
| |