Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
en:cnk:syn:verze7 [2018/12/20 12:37]
Michal Škrabal
en:cnk:syn:verze7 [2018/12/20 13:48]
Michal Křen [Corpus SYN version 7]
Line 5: Line 5:
 ^ <fs medium>​Name</​fs>​ ^^ <fs medium>​SYN version 7</​fs>​ ^ ^ <fs medium>​Name</​fs>​ ^^ <fs medium>​SYN version 7</​fs>​ ^
 ^ [[pojmy:​atributy_pozicni|Position]] ^ Number of tokens |  5 100 437 261 |  ​ ^ [[pojmy:​atributy_pozicni|Position]] ^ Number of tokens |  5 100 437 261 |  ​
-^ ::: ^ Number of tokens without punctuation ​ |  4 033 268 842 |  ​+^ ::: ^ Number of tokens without punctuation ​ |  4 255 216 412 |  ​
 ^ ::: ^ Number of [[en:​pojmy:​word|word forms]] ​ |  11 632 632 |  ​ ^ ::: ^ Number of [[en:​pojmy:​word|word forms]] ​ |  11 632 632 |  ​
 ^ ::: ^ Number of [[en:​pojmy:​lemma|lemmas]] |  8 360 795 | ^ ::: ^ Number of [[en:​pojmy:​lemma|lemmas]] |  8 360 795 |
-^ [[en:​pojmy:​atributy_strukturni|Structures]] ^ Number of [[en:​pojmy:​doc|documents]] |  106 350  |+^ [[en:​pojmy:​atributy_strukturni|Structures]] ^ Number of [[en:​pojmy:​doc|documents]] |  106 350 |
 ^ ::: ^ Number of [[en:​pojmy:​atributy_strukturni|texts]]| ​ 16 377 839 | ^ ::: ^ Number of [[en:​pojmy:​atributy_strukturni|texts]]| ​ 16 377 839 |
 ^ ::: ^ Number of sentences |  325 540 933 | ^ ::: ^ Number of sentences |  325 540 933 |
Line 16: Line 16:
 </​WRAP>​ </​WRAP>​
  
-Every **SYN corpus** contains all the [[en:​pojmy:​synchronni|synchronic]] [[en:​pojmy:​psany|written]] corpora of the [[en:​cnk:​syn|SYN]] series published up until the time of the given version'​s publication. The corpus SYN version 7 therefore contains the corpora ​ [[en:​cnk:​syn2000|SYN2000]],​ [[en:​cnk:​syn2005|SYN2005]],​ [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2010|SYN2010]],​[[en:​cnk:​syn2013pub|SYN2013PUB]] and [[cnk:​syn2015|SYN2015]];​ additionally,​ it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:​cnk:​syn:​verze4|SYN version 4]], [[en:​cnk:​syn:​verze5|SYN version 5]] and [[en:​cnk:​syn:​verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume ​more than 265 mil. words.+Every **SYN corpus** contains all the [[en:​pojmy:​synchronni|synchronic]] [[en:​pojmy:​psany|written]] corpora of the [[en:​cnk:​syn|SYN]] series published up until the time of the given version'​s publication. The corpus SYN version 7 therefore contains the corpora ​ [[en:​cnk:​syn2000|SYN2000]],​ [[en:​cnk:​syn2005|SYN2005]],​ [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2010|SYN2010]],​[[en:​cnk:​syn2013pub|SYN2013PUB]] and [[cnk:​syn2015|SYN2015]];​ additionally,​ it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:​cnk:​syn:​verze4|SYN version 4]], [[en:​cnk:​syn:​verze5|SYN version 5]] and [[en:​cnk:​syn:​verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume ​almost 200 mil. words.
  
-Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.033 billion words ([[en:​pojmy:​token|tokens]] without punctuation). The SYN corpus is not [[en:​pojmy:​reprezentativnost|representative]];​ the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017.+Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.255 billion words ([[en:​pojmy:​token|tokens]] without punctuation). The SYN corpus is not [[en:​pojmy:​reprezentativnost|representative]];​ the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017.
  
 The SYN version 7 corpus is [[en:​pojmy:​referencni|referential]],​ and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus. Individual versions of the corpus SYN will continue to be published regularly every year with the addition of current journalistic data, and every new addition will be given the attribute value ''<​doc syn>''​ equal to the version of the SYN corpus in which the given text first appeared; for example a [[en:​pojmy:​subkorpus|subcorpus]] corresponding to the above mentioned (as yet unpublished) journalistic component can be [[en:​manualy:​kontext:​subkorpus#​vytvoreni_noveho_subkorpusu|created]] from the SYN version 7 with the help of the condition ''​syn=<​nowiki>"</​nowiki>​v7<​nowiki>"</​nowiki>''​. The SYN version 7 corpus is [[en:​pojmy:​referencni|referential]],​ and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus. Individual versions of the corpus SYN will continue to be published regularly every year with the addition of current journalistic data, and every new addition will be given the attribute value ''<​doc syn>''​ equal to the version of the SYN corpus in which the given text first appeared; for example a [[en:​pojmy:​subkorpus|subcorpus]] corresponding to the above mentioned (as yet unpublished) journalistic component can be [[en:​manualy:​kontext:​subkorpus#​vytvoreni_noveho_subkorpusu|created]] from the SYN version 7 with the help of the condition ''​syn=<​nowiki>"</​nowiki>​v7<​nowiki>"</​nowiki>''​.