AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn2020 [2020/12/22 09:28] michalskrabalen:cnk:syn2020 [2022/06/09 13:36] (current) – [How to cite SYN2020] jankrivan
Line 1: Line 1:
 ====== SYN2020 Corpus ====== ====== SYN2020 Corpus ======
  
-The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2010|SYN2010]]), issued at five-year intervals, and covers the time period since 1989. Each of the SYN series corpora primarily covers the language of the last five years preceding its publication; thus, SYN2020 focuses on the 2015–2019 period. None of the texts in SYN2020 were included in another corpus of this series (the corpora are mutually disjoint). The SYN2020 corpus is lemmatized and morphologically tagged, just as the SYN2015 corpus it also contains syntactic annotationbut in comparison with the other corpora there are a number of changes:+The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2015|SYN2015]]), issued at five-year intervals, and covers the time period since 1989. Each of the SYN series corpora primarily covers the language of the last five years preceding its publication; thus, SYN2020 focuses on the 2015–2019 period. None of the texts in SYN2020 were included in another corpus of this series (the corpora are mutually disjoint). The SYN2020 corpus is lemmatized and morphologically tagged, and similarly to SYN2015it is also syntactically annotated. However, there are a number of significant changes in the annotation that are described in a separate section below.
  
-<WRAP right 35%>+<WRAP round tip 70%> 
 +The design of SYN2020, its composition, text classification, and concept of synchronicity are fully compatible with SYN2015. 
 +</WRAP> 
 + 
 +<WRAP right 45%>
 ^ <fs medium>Name</fs> ^^ <fs medium>SYN2015</fs> ^ ^ <fs medium>Name</fs> ^^ <fs medium>SYN2015</fs> ^
 ^ Positions ^ Number of positions (tokens) |  121 826 797 |   ^ Positions ^ Number of positions (tokens) |  121 826 797 |  
 ^ ::: ^ Number of positions (excl. punctuation) |  100 031 037 |   ^ ::: ^ Number of positions (excl. punctuation) |  100 031 037 |  
-^ ::: ^ Number of word forms |  1 751 599 |   +^ ::: ^ Number of word forms |  1 701 465 |   
-^ ::: ^ Number of lemmas |  777 011 |+^ ::: ^ Number of lemmas |  726 822 |
 ^ Structures ^ Number of documents <doc> |  3 910 | ^ Structures ^ Number of documents <doc> |  3 910 |
 ^ ::: ^ Number of texts <text> |  114 211 | ^ ::: ^ Number of texts <text> |  114 211 |
Line 18: Line 22:
 </WRAP> </WRAP>
  
-====== Složení korpusu SYN2020 ======+====== Composition of SYN2020 ======
  
 ==== Representativeness ==== ==== Representativeness ====
  
-SYN2020 contains a large spectrum of different types of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber’s notion of representativeness in terms of texts as products. The corpus is designed as representative, but not claimed to be balanced.  +SYN2020 contains a large spectrum of different types of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber’s notion of representativeness in terms of texts as products. The corpus is designed as representative, but not claimed to be balanced. Starting with SYN2015the concept of writing was narrowed down only to the language printed and publicly publishedThusSYN2020 does not containfor exampleinscriptions in public spaceprivate lettersposters or other ephemerals, and it also does not include texts published only on the Internet (for these there are special corpora of Internet Czeche.g. [[en:cnk:net|NET]] or [[en:cnk:online|ONLINE]].
-Z hlediska [[pojmy:reprezentativnost|reprezentativnosti]] je složení textů v korpusu SYN2020 arbitrární: tři hlavní [[pojmy:txtype_group|textové makrotypy]] – beletrie (FIC)oborová literatura (NFC) a publicistika (NMG) – jsou zastoupeny stejným dílem (tjvždy jednou třetinou). Cílem bylo zahrnout co nejširší spektrum různých typů veřejných psaných (tištěných) komunikátůkteré jako celek reprezentují současnou psanou češtinu; neodráží však jazykovou populaci v přesně daných proporcíchtedy reálný poměr výskytu textů ani jejich recepci. Platí přitomže pojetí psanosti bylo počínaje korpusem SYN2015 zúženo pouze na jazyk tištěný a veřejně publikovaný; ani korpus SYN2020 tedy neobsahuje např. nápisy ve veřejném prostorusoukromé dopisyplakáty nebo další tzv. efemera a nejsou do něj zahrnuty ani texty publikované pouze na internetu (pro ty existují speciální korpusy internetové češtinynapř. [[cnk:net|NET]] či [[cnk:online|ONLINE]])+
  
-==== Klasifikace textů ====+==== Text classification ====
  
-The classification of texts in SYN2020 is based on external, non-text criteria and is hierarchical. The highest level is determined by the three already mentioned text macrotypes (''txtype_group''s): fiction, non-fiction and newspapers and magazines, each of which is represented by an equal amount of data (i.e. one-third) Another level of division is a ''txtype'', which divides, for example, prose (novels alongside short stories), poetry and drama within fiction. The most fine-grained level of text classification is the a ''genre'', to which the general category ''genre_group'' is superior to texts of non-fiction (NFC) - this is how individual disciplines mathematics (MAT), technology (TEC) and information technology (ICT) are merged into the general group of formal and technical sciences (FTS)+The classification of texts in SYN2020 is based on external, non-text criteria and is hierarchical. The highest level is determined by the three already mentioned text macrotypes (''txtype_group''s): fiction, non-fiction and newspapers and magazines, each of which is represented by an equal amount of data (i.e. one-third) Another level of division is a ''txtype'', which divides, for example, prose (novels alongside short stories), poetry and drama within fiction. The most fine-grained level of text classification is a ''genre'', to which the general category ''genre_group'' is superior to texts of non-fiction (NFC) - this is how individual disciplines mathematics (MAT), technology (TEC) and information technology (ICT) are merged into the general group of formal and technical sciences (FTS).
- +
-Details on composition and classification can be found here: [[en:cnk:klasifikace_textu_syn2015|Overview of text classification in SYN2015]].+
  
 ^ Txtype_group ^ Portion ^ ^ Txtype_group ^ Portion ^
Line 35: Line 36:
 | NFC: non-fiction |  33,33 % | | NFC: non-fiction |  33,33 % |
 | NMG: newspapers and magazines |  33,33 % | | NMG: newspapers and magazines |  33,33 % |
- 
-[{{:en:cnk:nfc-en.png?direct&400|Composition of non-fiction (NFC) part of the SYN2015}}] 
-[{{:en:cnk:roky-nmg-en.png?direct&400|Proportion of traditional and leisure journalism within the newspapers and magazines in each year}}] 
- 
- 
-<WRAP clear></WRAP> 
  
 In line with its predecessors, SYN2020 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2020 are set arbitrarily, yet close to the original figures.  In line with its predecessors, SYN2020 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2020 are set arbitrarily, yet close to the original figures. 
Line 46: Line 41:
 Next to the text type and genre, metadata related to the text classification and available for every document also include medium (book, journal, textbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (general, children/youth). Standard division of the newspapers into the individual articles is also supplemented by their separate classification into 13 sections (politics, economics, sports, culture, leisure, commentaries etc.) and information about the author that is available for all prominent newspaper titles. Next to the text type and genre, metadata related to the text classification and available for every document also include medium (book, journal, textbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (general, children/youth). Standard division of the newspapers into the individual articles is also supplemented by their separate classification into 13 sections (politics, economics, sports, culture, leisure, commentaries etc.) and information about the author that is available for all prominent newspaper titles.
  
-A more detailed description of the text types contained within the macro groups:+A more detailed description of the text types contained within the macrogroups:
  
 ^  txtype  ^  genre / genre_group  ^  category  ^  proportion  ^ ^  txtype  ^  genre / genre_group  ^  category  ^  proportion  ^
Line 79: Line 74:
   * the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019).   * the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019).
  
-The resulting makeup of the corpus in no. of words over the years is summarized by the following graph. +===== Annotation of SYN2020changes compared to other corpora of the SYN series =====
- +
- [{{:en:cnk:roky-en.png?direct&600|Proportion of fiction, non-fiction, newspapers and magazines in each year}}] +
- +
-==== Positional annotation and tagging ==== +
- +
-Compared to previous versions there have been improvements in [[en:pojmy:lemma|lemmatization]] and [[en:pojmy:morfologicka_analyza|morphological tagging]]; both are almost identical to the processes used for the corpus [[en:cnk:syn2013pub|SYN2013PUB]], nonetheless SYN2015 was processed using the newest versions of all the tools (the improvements relate both to the morphological dictionary and to the rule-based [[en:pojmy:desambiguace|disambiguation]]). Furthermore, the lemmatization of punctuation marks has changed, preserving the form of the characters as much as possible.  +
- +
-Last but not least, SYN2015 is the first CNC corpus featuring a **[[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_analyza|syntactic annotation]]**. +
- +
- +
- +
-Zastoupení v rámci jednotlivých makroskupin shrnují následující grafy. +
- +
-[{{:cnk:syn2015-fic.png?direct&300|Typy textů v beletrii}}] +
-[{{:cnk:syn2015-nfc.png?direct&330|Typy textů v oborové literatuře}}] +
-[{{:cnk:syn2015-nmg.png?direct&350|Typy textů v publicistice}}] +
- +
-==== Pojetí synchronie ==== +
- +
-[{{ :cnk:syn2015-roky.png?direct&600|Počet slov podle roku vydání (nemusí být první vydání).}}] +
- +
-FIXME +
- +
-Vycházíme z předpokladu, že za [[pojmy:synchronni|synchronní]] lze považovat text, který se stále čte (resp. vydává), což v praxi indikuje rok vydání. Hranice synchronie se však u tří hlavních makroskupin liší: +
- +
-  * pro beletrii platí strategie 25 + 75, tj. doba od prvního vydání nepřesahuje 75 let (přibližně tři žijící generace) a konkrétní vydání díla zařazovaného do korpusu není starší 25 let (zajištění současné recepce), +
-  * u odborných textů platí požadavek prvního vydání v posledních 25 letech, +
-  * hranice synchronie publicistických titulů zůstává nezměněna, tj. text musí být vydán v období mapovaném daným korpusem (v případě SYN2020 je to období let 2015 až 2019). +
- +
-Výsledné složení korpusu podle počtu slov v jednotlivých letech shrnuje sloupcový graf. +
- +
-===== Changes with respect to other corpora of the SYN series =====+
  
 ==== Tokenization ==== ==== Tokenization ====
Line 117: Line 80:
 In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (//řekl , že//) and some other combinations, such as the hyphen before the enclitic form //li// (//mohu - li//), have been tokenized in a separate way. In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (//řekl , že//) and some other combinations, such as the hyphen before the enclitic form //li// (//mohu - li//), have been tokenized in a separate way.
  
-In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified  as separate tokens, but some combinations of characters remain unseparated according to predefined rules and word lists (eg words such as //česko-německý//, //wi-fi//, //r’n’b//, //Jang-c’-ťiang//, //CO2//, //12letý//). These principles are/will be presented on the //tokenization of numeric and punctuation marks// page.+In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified  as separate tokens, but some combinations of characters remain unseparated according to predefined rules and word lists (eg words such as //česko-německý//, //wi-fi//, //r’n’b//, //Jang-c’-ťiang//, //CO2//, //12letý//). These principles are/will be presented on the //tokenization// page.
  
 ==== Lemmatization ==== ==== Lemmatization ====
Line 145: Line 108:
 ==== Multiple lemmatization and tagging (aggregate) ==== ==== Multiple lemmatization and tagging (aggregate) ====
  
-In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates**, are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (//aby//, //kdyby//), the connection of words with the the enclitical form //s// (//dělalas//, //viděls//, //komus//, //vždyťs//), the connection of prepositions with some pronouns (//nač//, //očpak//, //zaň//), or a combination of words of the last two types (//načs//). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page.+In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates** ("multiword tokens" in the [[https://universaldependencies.org/|Universal Dependencies]] terminology), are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (//aby//, //kdyby//), the connection of words with the the enclitical form //s// (//dělalas//, //viděls//, //komus//, //vždyťs//), the connection of prepositions with some pronouns (//nač//, //očpak//, //zaň//), or a combination of words of the last two types (//načs//). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page
 + 
 +==== Automatic corpus annotation ==== 
 +For SYN2020, the entire annotation process is automatic. Its detailed description including the annotation accuracy and a rich bibliography to both the tools and data can be found on a [[cnk:syn2020:automaticka_anotace|dedicated page]] (Czech only).
  
 ====== How to cite SYN2020 ====== ====== How to cite SYN2020 ======
 +<WRAP round tip 70%>
 +Křen, M. – Cvrček, V. – Henyš, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //SYN2020: reprezentativní korpus psané češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020. Dostupný z WWW: http://www.korpus.cz
  
-====== Related links ======+Jelínek, T. – Křivan, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. (2021): [[https://doi.org/10.1007/978-3-030-83527-9_4|SYN2020: A new corpus of Czech with an innovated annotation]]. In: K. Ekštein – F. Pártl – M. Konopík (eds.), //Text, Speech, and Dialogue.// TSD 2021. Lecture Notes in Computer Science, vol. 12848. Cham: Springer, 48–59. 
 + 
 +Křivan, J. – Šindlerová, J. (2022): [[http://sas.ujc.cas.cz/archiv.php?lang=en&art=4508|Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu]]. //Slovo a slovesnost//, 83, 2/2022, 122–145.
  
-<WRAP round box 49%> 
-[[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:syn2013PUB|SYN2013PUB]] • [[en:cnk:syn2015|SYN2015]]  
 </WRAP> </WRAP>
 +