AplikaceAplikace
Nastavení

This is an old revision of the document!


SYN2020 Corpus

The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series (SYN2000, SYN2005, SYN2010, SYN2015), issued at five-year intervals, and covers the time period since 1989. Each of the SYN series corpora primarily covers the language of the last five years preceding its publication; thus, SYN2020 focuses on the 2015–2019 period. None of the texts in SYN2020 were included in another corpus of this series (the corpora are mutually disjoint). The SYN2020 corpus is lemmatized and morphologically tagged, and similarly to SYN2015, it is also syntactically annotated. However, there are a number of significant changes in the annotation that are described in a separate section below.

The design of SYN2020, its composition, text classification, and concept of synchronicity are fully compatible with SYN2015.

Name SYN2015
Positions Number of positions (tokens) 121 826 797
Number of positions (excl. punctuation) 100 031 037
Number of word forms 1 701 465
Number of lemmas 726 822
Structures Number of documents <doc> 3 910
Number of texts <text> 114 211
Number of paragraphs <p> 2 855 289
Number of sentences <s> 7 997 312
Further information Reference corpus YES
Representative corpus YES
Publication year 2020

Composition of SYN2020

Representativeness

SYN2020 contains a large spectrum of different types of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber’s notion of representativeness in terms of texts as products. The corpus is designed as representative, but not claimed to be balanced. Starting with SYN2015, the concept of writing was narrowed down only to the language printed and publicly published. Thus, SYN2020 does not contain, for example, inscriptions in public space, private letters, posters or other ephemerals, and it also does not include texts published only on the Internet (for these there are special corpora of Internet Czech, e.g. NET or ONLINE.

Text classification

The classification of texts in SYN2020 is based on external, non-text criteria and is hierarchical. The highest level is determined by the three already mentioned text macrotypes (txtype_groups): fiction, non-fiction and newspapers and magazines, each of which is represented by an equal amount of data (i.e. one-third) Another level of division is a txtype, which divides, for example, prose (novels alongside short stories), poetry and drama within fiction. The most fine-grained level of text classification is a genre, to which the general category genre_group is superior to texts of non-fiction (NFC) - this is how individual disciplines mathematics (MAT), technology (TEC) and information technology (ICT) are merged into the general group of formal and technical sciences (FTS).

Txtype_group Portion
FIC: fiction 33,33 %
NFC: non-fiction 33,33 %
NMG: newspapers and magazines 33,33 %

In line with its predecessors, SYN2020 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2020 are set arbitrarily, yet close to the original figures.

Next to the text type and genre, metadata related to the text classification and available for every document also include medium (book, journal, textbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (general, children/youth). Standard division of the newspapers into the individual articles is also supplemented by their separate classification into 13 sections (politics, economics, sports, culture, leisure, commentaries etc.) and information about the author that is available for all prominent newspaper titles.

A more detailed description of the text types contained within the macrogroups:

txtype genre / genre_group category proportion
Fiction (FIC) 33,33 %
NOV novels 26 %
COL short stories 5 %
VER poetry 1 %
SCR drama, screenplays 1 %
X other 0,33 %
Non-fiction (NFC) 33,33 %
SCI (scientific)

PRO (professional)

POP (popular)
HUM humanities 7 %
SSC social sciences 7 %
NAT natural sciences 7 %
FTS technical sciences 7 %
ITD interdisciplinary 1 %
MEM memoirs, autobiographies 4 %
ADM administrative texts 0,33 %
Newspapers and magazines (NMG) 33,33 %
NEW NTW nationawide newspapers – selected titles (MF, LN, HN, Právo) 10 %
NTW nationawide newspapers – other 5 %
REG regional newspapers 5 %
LEI leisure magazines 13,33 %

A detailed information about the text classification scheme is available here.

Concept of synchronicity

We are working under the assumption that a synchronic text is one that is still being read (or published), which is indicated by the year of publication. The boundaries of synchrony differ for each of the three macro groups:

  • for fiction it is 25 + 75, i.e. the time elapsed since the first publication is less than 75 years (approximately three living generations) and the given issue of the text being added to the corpus is no older than 25 years (ensuring reception in the present),
  • for non-fiction texts the first issue must be no older than 25 years,
  • the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019).

The resulting makeup of the corpus in no. of words over the years is summarized by the following graph.

Proportion of fiction, non-fiction, newspapers and magazines in each year

FIXME

Changes with respect to other corpora of the SYN series

Tokenization

In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (řekl , že) and some other combinations, such as the hyphen before the enclitic form li (mohu - li), have been tokenized in a separate way.

In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified as separate tokens, but some combinations of characters remain unseparated according to predefined rules and word lists (eg words such as česko-německý, wi-fi, r’n’b, Jang-c’-ťiang, CO2, 12letý). These principles are/will be presented on the tokenization page.

Lemmatization

A fundamental change in the annotation of the SYN2020 corpus is the introduction of the so-called two-level lemmatization: now each word form is assigned a sublemma attribute in addition to the lemma one. While the lemma associates several variants of one word in accordance with the earlier corpora of the SYN series (eg the filozofie lemma represents all the forms with the filozof and also filosof root), the sublemmas define subgroups of word forms with respect to this variability (sublemma filozofie represents only word forms with the filozof root, the filosofie sublemma only the forms associated with the filosof root). In case the word has no variants, the sublemma is identical to the lemma (eg the kniha lemma represents the same set of forms as the kniha sublemma).

Different types of variants are accounted for as sublemmas (eg mýdlo/mejdlo, okno/vokno, citron/citrón, email/e-mail, myslet/myslit, mýt/mejt, péci/péct/píct, kuchyně/kuchyň, antivirus/antivir, sedm/sedum, tenhle/tendle/tenle, ačkoli/ačkoliv, proper names Robert/Róbert/Roberto, Atény/Athény) and by means of these sublemmas some specific groups of forms are distinguished that are traditionally covered under one lemma (eg negated forms of adjectives and adverbs černý/nečerný, hezky/nehezky, nominal forms of adjectives mladý/mlád and suppletive forms dobře/lépe/líp, člověk/lidé).

In connection with these changes, the lemmatization was significantly refined compared to the previous corpora of the SYN series, many lemmas were corrected and other tens of thousands of lemmas are now recognizable in the SYN2020 corpus. A detailed description of the changes is presented on the lemmatization page.

Morphological tagging

From the SYN2020 corpus onwards, each morphological tag has 15 positions (instead of the previous 16 ones). The annotation of verbal aspect is transferred from the canceled 16th position to the originally unused 13th one, otherwise the tag structure is identical to the structure present in existing corpora of the SYN series.

The annotation changes themselves concern the following three positions in the tag. In the 1st position (part of speech), the values ​​F (foreign word), B (abbreviation) and S (segment) are now distinguished. At the same time, the part-of-speech classification of some words and forms was re-evaluated (especially in the category of ​​numerals, predicatives and nominal forms of adjectives). In the 2nd position (detailed part-of-speech specification), new values were introduced in connection with the new parts of speech and some other were removed. The subdivision of numerals has been substantially modified (eg. the z value is now used for the numerals sto, tisíc, milion originally tagged as nouns) and the value 0 was added in order to identify non sentence-final punctuation. One change concerns the 15th position (variant): number 8 (so far reserved for abbreviations) is now used as a value coding another variant of colloquial Czech.

The reliability of automatic lemmatization and morphological tagging of the SYN2020 corpus is significantly higher than was the case with previous corpora of the SYN series.

A detailed overview of the changes is presented on the morphological marking page.

Verb tagging (verbtag)

A newly introduced verbal tag (verbtag) contains morphological information about the whole verb form, regardless of whether it is a compound form (viděl jsem) or a simple one (vidím). In the verbtag, on the one hand, the auxiliary verb differs from the autosemantic one, and on the other hand, for each autosemantic verb form, the following categories of manner, voice, person, number and tense are specified (valid for the whole verb form). The verb tag is assigned to each token in the corpus, but it takes appropriate values ​​only for verbs (and with one exception for deverbal adjectives). For the full presentation of the verbtag, see the verbtag page.

Multiple lemmatization and tagging (aggregate)

In the SYN2020 corpus, multiple lemmas and tags for a special group of words, so-called aggregates, are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (aby, kdyby), the connection of words with the the enclitical form s (dělalas, viděls, komus, vždyťs), the connection of prepositions with some pronouns (nač, očpak, zaň), or a combination of words of the last two types (načs). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page.

How to cite SYN2020

FIXME

Related links