| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:cnk:syn:verze13 [2024/12/23 13:53] – [Journalism in SYN version 13] michalkren | en:cnk:syn:verze13 [2026/01/23 11:49] (current) – [Structure and annotation of SYN version 13] krivan |
|---|
| <WRAP right 35%> | <WRAP right 35%> |
| ^ <fs medium>Name</fs> ^^ <fs medium>SYN version 13</fs> ^ | ^ <fs medium>Name</fs> ^^ <fs medium>SYN version 13</fs> ^ |
| ^ [[pojmy:atributy_pozicni|Position]] ^ Number of tokens | 6 238 142 297 | | ^ [[pojmy:atributy_pozicni|Position]] ^ Number of tokens | 6 400 899 055 | |
| ^ ::: ^ Number of tokens without punctuation | 5 174 701 189 | | ^ ::: ^ Number of tokens without punctuation | 5 310 635 949 | |
| ^ ::: ^ Number of [[en:pojmy:word|word forms]] | 11 384 712 | | ^ ::: ^ Number of [[en:pojmy:word|word forms]] | 11 522 926 | |
| ^ ::: ^ Number of [[en:pojmy:lemma|lemmas]] | 7 604 956 | | ^ ::: ^ Number of [[en:pojmy:lemma|lemmas]] | 7 655 932 | |
| ^ Structures ^ Number of documents | 144 755 | | ^ Structures ^ Number of documents | 151 076 | |
| ^ ::: ^ Number of texts | 18 965 216 | | ^ ::: ^ Number of texts | 19 363 730 | |
| ^ ::: ^ Number of sentences | 398 423 123 | | ^ ::: ^ Number of sentences | 408 749 819 | |
| ^ Other information ^ Referential | YES | | ^ Other information ^ Referential | YES | |
| ^ ::: ^ Representative | NO (predominantly journalism) | | ^ ::: ^ Representative | NO (predominantly journalism) | |
| ^ ::: ^ Publication year | 2023 | | ^ ::: ^ Publication year | 2024 | |
| </WRAP> | </WRAP> |
| |
| Every **SYN corpus** contains all the [[en:pojmy:synchronni|synchronic]] [[en:pojmy:psany|written]] corpora of the [[en:cnk:syn|SYN]] series published up until the time of the given version's publication. The corpus SYN version 13 therefore contains the [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]],[[en:cnk:syn2013pub|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]] and [[en:cnk:syn2020|SYN2020]] corpora; additionally, it contains a journalistic component predominantly from 2010–2022 (already included into [[en:cnk:syn:verze4|SYN version 4]] -- [[en:cnk:syn:verze12|SYN version 12]]) corpora, and as yet **unpublished journalistic texts from 2023** in yearly volume almost 150 mil. words. | Every **SYN corpus** contains all the [[en:pojmy:synchronni|synchronic]] [[en:pojmy:psany|written]] corpora of the [[en:cnk:syn|SYN]] series published up until the time of the given version's publication. The corpus SYN version 13 therefore contains the [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]],[[en:cnk:syn2013pub|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]] and [[en:cnk:syn2020|SYN2020]] corpora; additionally, it contains a journalistic component predominantly from 2010–2022 (already included into [[en:cnk:syn:verze4|SYN version 4]] -- [[en:cnk:syn:verze12|SYN version 12]]) corpora, and as yet **unpublished journalistic texts from 2023** in yearly volume of more than 100 mil. words. |
| |
| The SYN corpus is not [[en:pojmy:reprezentativnost|representative]]; the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2013pub|SYN2013PUB]] and the journalistic component from 2010--2023. | The SYN corpus is not [[en:pojmy:reprezentativnost|representative]]; the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2013pub|SYN2013PUB]] and the journalistic component from 2010--2023. |
| The composition of the journalistic part of SYN version 13 covers the production of most of the national daily newspapers (//Mladá fronta DNES, Lidové noviny, Právo, Hospodářské noviny, Blesk, Sport//), regional daily newspapers (chiefly //Deníky Bohemia// and //Moravia// published by Vltava Labe Media) and non-specialized magazines (//Reflex, Respekt, Týden//) from 1998--2023; the total number of journalistic titles is almost 200. The following graphs show the composition of the SYN corpus based on the [[en:pojmy:txtype_group|main text types]] over the years and offer a closer look at the composition of the journalistic section. | The composition of the journalistic part of SYN version 13 covers the production of most of the national daily newspapers (//Mladá fronta DNES, Lidové noviny, Právo, Hospodářské noviny, Blesk, Sport//), regional daily newspapers (chiefly //Deníky Bohemia// and //Moravia// published by Vltava Labe Media) and non-specialized magazines (//Reflex, Respekt, Týden//) from 1998--2023; the total number of journalistic titles is almost 200. The following graphs show the composition of the SYN corpus based on the [[en:pojmy:txtype_group|main text types]] over the years and offer a closer look at the composition of the journalistic section. |
| |
| [{{cnk:syn:slozeni_syn_v13.png?400|Composition of SYN version 13}}] | [{{:cnk:syn:slozeni_syn_v13.png?400|Composition of SYN version 13}}] |
| |
| [{{cnk:syn:slozeni_syn_v13_pub.png?400|Composition of the journalistic part of SYN version 13}}] | [{{:cnk:syn:slozeni_syn_v13_pub.png?400|Composition of the journalistic part of SYN version 13}}] |
| |
| ====== Structure and annotation of SYN version 13 ====== | ====== Structure and annotation of SYN version 13 ====== |
| |
| Generally speaking, structure and annotation of SYN version 13 are based on that of the SYN2020 corpus. In particular, hierarchy of structural tags for SYN version 13 has been taken over from SYN2020, as well as the [[en:cnk:syn2020#annotation_of_syn2020changes_compared_to_other_corpora_of_the_syn_series|lemmatization and morphological tagging]]. In this respect, SYN version 13 is the same as its predecessor, [[en:cnk:syn:verze12|SYN version 12]]. | Generally speaking, structure and annotation of SYN version 13 are based on that of the SYN2020 corpus. Hierarchy of structural tags for SYN version 13 has been taken over from SYN2020. Morphological tagging, lemmatization, and tokenization of the corpus are performed fully automatically according to the [[en:cnk:anotacni_standard_cnk|unified CNC annotation scheme]]. In this respect, SYN version 13 is the same as its predecessor, [[en:cnk:syn:verze12|SYN version 12]]. |
| |
| The correspondence of structure and annotation between SYN version 13 and [[en:cnk:syn2020|SYN2020]] only has the following exceptions: | The correspondence of structure and annotation between SYN version 13 and [[en:cnk:syn2020|SYN2020]] only has the following exceptions: |
| |
| <WRAP round tip 70%> | <WRAP round tip 70%> |
| Křen, M. – Cvrček, V. – Čapka, T. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //Corpus SYN, version 13 from 29. 12. 2024//. Ústav Českého národního korpusu FF UK, Praha 2024. Available online: https://www.korpus.cz. | Křen, M. – Cvrček, V. – Čapka, T. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //Corpus SYN, version 13 from 27. 12. 2024//. Ústav Českého národního korpusu FF UK, Praha 2024. Available online: https://www.korpus.cz. |
| |
| |