AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn:verze13 [2024/12/23 13:53] – [Journalism in SYN version 13] michalkrenen:cnk:syn:verze13 [2026/01/23 11:49] (current) – [Structure and annotation of SYN version 13] krivan
Line 4: Line 4:
 <WRAP right 35%> <WRAP right 35%>
 ^ <fs medium>Name</fs> ^^ <fs medium>SYN version 13</fs> ^ ^ <fs medium>Name</fs> ^^ <fs medium>SYN version 13</fs> ^
-^ [[pojmy:atributy_pozicni|Position]] ^ Number of tokens |  6 238 142 297 |   +^ [[pojmy:atributy_pozicni|Position]] ^ Number of tokens |  6 400 899 055 |   
-^ ::: ^ Number of tokens without punctuation  |  5 174 701 189 |   +^ ::: ^ Number of tokens without punctuation  |  5 310 635 949 |   
-^ ::: ^ Number of [[en:pojmy:word|word forms]]  |  11 384 712 |   +^ ::: ^ Number of [[en:pojmy:word|word forms]]  |  11 522 926 |   
-^ ::: ^ Number of [[en:pojmy:lemma|lemmas]] |  7 604 956 +^ ::: ^ Number of [[en:pojmy:lemma|lemmas]] |  7 655 932 
-^ Structures ^ Number of documents |  144 755 +^ Structures ^ Number of documents |  151 076 
-^ ::: ^ Number of texts |  18 965 216 +^ ::: ^ Number of texts |  19 363 730 
-^ ::: ^ Number of sentences |  398 423 123 |+^ ::: ^ Number of sentences |  408 749 819 |
 ^ Other information ^ Referential |  YES |   ^ Other information ^ Referential |  YES |  
 ^ ::: ^ Representative |  NO (predominantly journalism) |   ^ ::: ^ Representative |  NO (predominantly journalism) |  
-^ ::: ^ Publication year |  2023 |+^ ::: ^ Publication year |  2024 |
 </WRAP> </WRAP>
  
-Every **SYN corpus** contains all the [[en:pojmy:synchronni|synchronic]] [[en:pojmy:psany|written]] corpora of the [[en:cnk:syn|SYN]] series published up until the time of the given version's publication. The corpus SYN version 13 therefore contains the [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]],[[en:cnk:syn2013pub|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]] and [[en:cnk:syn2020|SYN2020]] corpora; additionally, it contains a journalistic component predominantly from 2010–2022 (already included into [[en:cnk:syn:verze4|SYN version 4]] -- [[en:cnk:syn:verze12|SYN version 12]]) corpora, and as yet **unpublished journalistic texts from 2023** in yearly volume almost 150 mil. words.+Every **SYN corpus** contains all the [[en:pojmy:synchronni|synchronic]] [[en:pojmy:psany|written]] corpora of the [[en:cnk:syn|SYN]] series published up until the time of the given version's publication. The corpus SYN version 13 therefore contains the [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]],[[en:cnk:syn2013pub|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]] and [[en:cnk:syn2020|SYN2020]] corpora; additionally, it contains a journalistic component predominantly from 2010–2022 (already included into [[en:cnk:syn:verze4|SYN version 4]] -- [[en:cnk:syn:verze12|SYN version 12]]) corpora, and as yet **unpublished journalistic texts from 2023** in yearly volume of more than 100 mil. words.
  
 The SYN corpus is not [[en:pojmy:reprezentativnost|representative]]; the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2013pub|SYN2013PUB]] and the journalistic component from 2010--2023. The SYN corpus is not [[en:pojmy:reprezentativnost|representative]]; the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2013pub|SYN2013PUB]] and the journalistic component from 2010--2023.
Line 45: Line 45:
 ====== Structure and annotation of SYN version 13 ====== ====== Structure and annotation of SYN version 13 ======
  
-Generally speaking, structure and annotation of SYN version 13 are based on that of the SYN2020 corpus. In particular, hierarchy of structural tags for SYN version 13 has been taken over from SYN2020, as well as the [[en:cnk:syn2020#annotation_of_syn2020changes_compared_to_other_corpora_of_the_syn_series|lemmatization and morphological tagging]]. In this respect, SYN version 13 is the same as its predecessor, [[en:cnk:syn:verze12|SYN version 12]].+Generally speaking, structure and annotation of SYN version 13 are based on that of the SYN2020 corpus. Hierarchy of structural tags for SYN version 13 has been taken over from SYN2020. Morphological tagginglemmatization, and tokenization of the corpus are performed fully automatically according to the [[en:cnk:anotacni_standard_cnk|unified CNC annotation scheme]]. In this respect, SYN version 13 is the same as its predecessor, [[en:cnk:syn:verze12|SYN version 12]].
  
 The correspondence of structure and annotation between SYN version 13 and [[en:cnk:syn2020|SYN2020]] only has the following exceptions: The correspondence of structure and annotation between SYN version 13 and [[en:cnk:syn2020|SYN2020]] only has the following exceptions:
Line 54: Line 54:
  
 <WRAP round tip 70%> <WRAP round tip 70%>
-Křen, M. – Cvrček, V. – Čapka, T. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //Corpus SYN, version 13 from 29. 12. 2024//. Ústav Českého národního korpusu FF UK, Praha 2024. Available online: https://www.korpus.cz.+Křen, M. – Cvrček, V. – Čapka, T. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //Corpus SYN, version 13 from 27. 12. 2024//. Ústav Českého národního korpusu FF UK, Praha 2024. Available online: https://www.korpus.cz.