This is an old revision of the document!
Table of Contents
SYN2025 Corpus
The SYN2025 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series (SYN2000, SYN2005, SYN2010, SYN2015, SYN2020), issued at five-year intervals, and covers the time period since 1989. Each of the SYN series corpora primarily covers the language of the last five years preceding its publication; thus, SYN2025 focuses on the 2020–2024 period. None of the texts in SYN2025 were included in another corpus of this series (the corpora are mutually disjoint). The SYN2025 corpus is lemmatized and morphologically tagged, and similarly to SYN2020, it is also syntactically annotated. However, there are a number of significant changes in the annotation that are described in a separate section below.
The SYN2025 corpus is based on the SYN2015 and SYN2020 corpora in terms of composition, text classification, and concept of synchronicity. There are only minor differences in a few parameters of the corpus composition which are indicated in this table.
| Name | SYN2025 | |
|---|---|---|
| Positions | Number of positions (tokens) | 122 072 831 |
| Number of positions (excl. punctuation) | 100 006 172 | |
| Number of word forms | 1 678 186 | |
| Number of lemmas | 708 674 | |
| Structures | Number of documents <doc> | 3 943 |
| Number of texts <text> | 103 937 | |
| Number of paragraphs <p> | 2 776 291 | |
| Number of sentences <s> | 7 725 939 | |
| Further information | Reference corpus | YES |
| Representative corpus | YES | |
| Publication year | 2025 | |
Composition of SYN2025
Representativeness
SYN2025 contains a large spectrum of different types of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber’s notion of representativeness in terms of texts as products. The corpus is designed as representative, but not claimed to be balanced. Starting with SYN2015, the concept of writing was narrowed down only to the language printed and publicly published. Thus, SYN2025 does not contain, for example, inscriptions in public space, private letters, posters or other ephemerals, and it also does not include texts published only on the Internet (for these there are special corpora of Internet Czech, e.g. NET or ONLINE.
Text classification
The classification of texts in SYN2025 is based on external, non-text criteria and is hierarchical. The highest level is determined by the three already mentioned text macrotypes (txtype_groups): fiction, non-fiction and newspapers and magazines, each of which is represented by an equal amount of data (i.e. one-third) Another level of division is a txtype, which divides, for example, prose (novels alongside short stories), poetry and drama within fiction. The most fine-grained level of text classification is a genre, to which the general category genre_group is superior to texts of non-fiction (NFC) - this is how individual disciplines mathematics (MAT), technology (TEC) and information technology (ICT) are merged into the general group of formal and technical sciences (FTS).
| Txtype_group | Portion |
|---|---|
| FIC: fiction | 33,33 % |
| NFC: non-fiction | 33,33 % |
| NMG: newspapers and magazines | 33,33 % |
In line with its predecessors, SYN2025 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2025 are in the table below.
Next to the text type and genre, metadata related to the text classification and available for every document also include medium (book, journal, textbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (general, children/youth). Standard division of the newspapers into the individual articles is also supplemented by their separate classification into 13 sections (politics, economics, sports, culture, leisure, commentaries etc.) and information about the author that is available for all prominent newspaper titles.
A more detailed description of the text types contained within the macrogroups:
| txtype | genre / genre_group | category | proportion |
|---|---|---|---|
| Fiction (FIC) | 33,33 % | ||
| NOV | novels | 28 % | |
| COL | short stories | 3 % | |
| VER | poetry | 1 % | |
| SCR | drama, screenplays | 1 % | |
| X | other | 0,33 % | |
| Non-fiction (NFC) | 33,33 % | ||
| SCI (scientific) PRO (professional) POP (popular) | HUM | humanities | 7 % |
| SSC | social sciences | 7 % | |
| NAT | natural sciences | 7 % | |
| FTS | technical sciences | 7 % | |
| ITD | interdisciplinary | 0.33 % | |
| MEM | memoirs, autobiographies | 4 % | |
| ADM | administrative texts | 1 % | |
| Newspapers and magazines (NMG) | 33,33 % | ||
| NEW | NTW | nationawide newspapers – selected titles (MF, LN, HN, Právo) | 10 % |
| NTW | nationawide newspapers – other | 5 % | |
| REG | regional newspapers | 5 % | |
| LEI | leisure magazines | 13,33 % | |
A detailed information about the text classification scheme is available here.
Concept of synchronicity
We are working under the assumption that a synchronic text is one that is still being read (or published), which is indicated by the year of publication. The boundaries of synchrony differ for each of the three macro groups:
- for fiction it is 25 + 75, i.e. the time elapsed since the first publication is less than 75 years (approximately three living generations) and the given issue of the text being added to the corpus is no older than 25 years (ensuring reception in the present),
- for non-fiction texts the first issue must be no older than 25 years,
- the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2025, it is the period between 2020 and 2024).
Annotation of SYN2025
Morphological tagging, lemmatization, and tokenization of the SYN2025 corpus are performed fully automatically according to the unified CNC annotation scheme, which was already applied to the SYN2020 corpus.
How to cite SYN2025
Křen, M. – Cvrček, V. – Čapka, T. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Marklová, A. – Petkevič, V. – Skoumalová, H. – Škrabal, M.: SYN2025: reprezentativní korpus psané češtiny. Ústav Českého národního korpusu FF UK, Praha 2025. Dostupný z WWW: http://www.korpus.cz
Cvrček, V. – Čermáková, A. – Křen, M. (2016): Nová koncepce synchronních korpusů psané češtiny. Slovo a slovesnost, 77 (2), 83–101.
Jelínek, T. – Křivan, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. (2021): SYN2020: A new corpus of Czech with an innovated annotation. In: K. Ekštein – F. Pártl – M. Konopík (eds.), Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science, vol. 12848. Cham: Springer, 48–59.
Křivan, J. – Šindlerová, J. (2022): Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu. Slovo a slovesnost, 83 (2), 122–145.