Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:syn2020 [2020/12/27 12:07] – [SYN2020 Corpus] michalkren | en:cnk:syn2020 [2022/06/09 13:36] (current) – [How to cite SYN2020] jankrivan |
---|
====== SYN2020 Corpus ====== | ====== SYN2020 Corpus ====== |
| |
The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2015|SYN2015]]), issued at five-year intervals, and covers the time period since 1989. Each of the SYN series corpora primarily covers the language of the last five years preceding its publication; thus, SYN2020 focuses on the 2015–2019 period. None of the texts in SYN2020 were included in another corpus of this series (the corpora are mutually disjoint). The SYN2020 corpus is lemmatized and morphologically tagged, just as the SYN2015 corpus it also contains syntactic annotation, but in comparison with the other corpora there are a number of changes: | The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2015|SYN2015]]), issued at five-year intervals, and covers the time period since 1989. Each of the SYN series corpora primarily covers the language of the last five years preceding its publication; thus, SYN2020 focuses on the 2015–2019 period. None of the texts in SYN2020 were included in another corpus of this series (the corpora are mutually disjoint). The SYN2020 corpus is lemmatized and morphologically tagged, and similarly to SYN2015, it is also syntactically annotated. However, there are a number of significant changes in the annotation that are described in a separate section below. |
| |
<WRAP round tip 70%> | <WRAP round tip 70%> |
The design of SYN2020, its composition, text classification, and synchony are fully compatible with SYN2015. | The design of SYN2020, its composition, text classification, and concept of synchronicity are fully compatible with SYN2015. |
</WRAP> | </WRAP> |
| |
<WRAP right 35%> | <WRAP right 45%> |
^ <fs medium>Name</fs> ^^ <fs medium>SYN2015</fs> ^ | ^ <fs medium>Name</fs> ^^ <fs medium>SYN2015</fs> ^ |
^ Positions ^ Number of positions (tokens) | 121 826 797 | | ^ Positions ^ Number of positions (tokens) | 121 826 797 | |
| NFC: non-fiction | 33,33 % | | | NFC: non-fiction | 33,33 % | |
| NMG: newspapers and magazines | 33,33 % | | | NMG: newspapers and magazines | 33,33 % | |
| |
[{{:en:cnk:nfc-en.png?direct&400|Composition of non-fiction (NFC) part of the SYN2015}}] | |
[{{:en:cnk:roky-nmg-en.png?direct&400|Proportion of traditional and leisure journalism within the newspapers and magazines in each year}}] | |
FIXME | |
| |
| |
In line with its predecessors, SYN2020 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2020 are set arbitrarily, yet close to the original figures. | In line with its predecessors, SYN2020 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2020 are set arbitrarily, yet close to the original figures. |
* the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019). | * the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019). |
| |
The resulting makeup of the corpus in no. of words over the years is summarized by the following graph. | ===== Annotation of SYN2020: changes compared to other corpora of the SYN series ===== |
| |
[{{:en:cnk:roky-en.png?direct&600|Proportion of fiction, non-fiction, newspapers and magazines in each year}}] | |
FIXME | |
| |
===== Changes with respect to other corpora of the SYN series ===== | |
| |
==== Tokenization ==== | ==== Tokenization ==== |
==== Multiple lemmatization and tagging (aggregate) ==== | ==== Multiple lemmatization and tagging (aggregate) ==== |
| |
In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates**, are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (//aby//, //kdyby//), the connection of words with the the enclitical form //s// (//dělalas//, //viděls//, //komus//, //vždyťs//), the connection of prepositions with some pronouns (//nač//, //očpak//, //zaň//), or a combination of words of the last two types (//načs//). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page. | In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates** ("multiword tokens" in the [[https://universaldependencies.org/|Universal Dependencies]] terminology), are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (//aby//, //kdyby//), the connection of words with the the enclitical form //s// (//dělalas//, //viděls//, //komus//, //vždyťs//), the connection of prepositions with some pronouns (//nač//, //očpak//, //zaň//), or a combination of words of the last two types (//načs//). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page. |
| |
| ==== Automatic corpus annotation ==== |
| For SYN2020, the entire annotation process is automatic. Its detailed description including the annotation accuracy and a rich bibliography to both the tools and data can be found on a [[cnk:syn2020:automaticka_anotace|dedicated page]] (Czech only). |
| |
====== How to cite SYN2020 ====== | ====== How to cite SYN2020 ====== |
FIXME | <WRAP round tip 70%> |
====== Related links ====== | Křen, M. – Cvrček, V. – Henyš, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //SYN2020: reprezentativní korpus psané češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020. Dostupný z WWW: http://www.korpus.cz |
| |
| Jelínek, T. – Křivan, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. (2021): [[https://doi.org/10.1007/978-3-030-83527-9_4|SYN2020: A new corpus of Czech with an innovated annotation]]. In: K. Ekštein – F. Pártl – M. Konopík (eds.), //Text, Speech, and Dialogue.// TSD 2021. Lecture Notes in Computer Science, vol. 12848. Cham: Springer, 48–59. |
| |
| Křivan, J. – Šindlerová, J. (2022): [[http://sas.ujc.cas.cz/archiv.php?lang=en&art=4508|Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu]]. //Slovo a slovesnost//, 83, 2/2022, 122–145. |
| |
<WRAP round box 49%> | |
[[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:syn2013PUB|SYN2013PUB]] • [[en:cnk:syn2015|SYN2015]] | |
</WRAP> | </WRAP> |
| |