Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
en:cnk:syn:verze7 [2018/12/20 13:00]
Michal Škrabal
en:cnk:syn:verze7 [2018/12/20 13:48]
Michal Křen [Corpus SYN version 7]
Line 8: Line 8:
 ^ ::: ^ Number of [[en:​pojmy:​word|word forms]] ​ |  11 632 632 |  ​ ^ ::: ^ Number of [[en:​pojmy:​word|word forms]] ​ |  11 632 632 |  ​
 ^ ::: ^ Number of [[en:​pojmy:​lemma|lemmas]] |  8 360 795 | ^ ::: ^ Number of [[en:​pojmy:​lemma|lemmas]] |  8 360 795 |
-^ [[en:​pojmy:​atributy_strukturni|Structures]] ^ Number of [[en:​pojmy:​doc|documents]] |  106 350  |+^ [[en:​pojmy:​atributy_strukturni|Structures]] ^ Number of [[en:​pojmy:​doc|documents]] |  106 350 |
 ^ ::: ^ Number of [[en:​pojmy:​atributy_strukturni|texts]]| ​ 16 377 839 | ^ ::: ^ Number of [[en:​pojmy:​atributy_strukturni|texts]]| ​ 16 377 839 |
 ^ ::: ^ Number of sentences |  325 540 933 | ^ ::: ^ Number of sentences |  325 540 933 |
Line 16: Line 16:
 </​WRAP>​ </​WRAP>​
  
-Every **SYN corpus** contains all the [[en:​pojmy:​synchronni|synchronic]] [[en:​pojmy:​psany|written]] corpora of the [[en:​cnk:​syn|SYN]] series published up until the time of the given version'​s publication. The corpus SYN version 7 therefore contains the corpora ​ [[en:​cnk:​syn2000|SYN2000]],​ [[en:​cnk:​syn2005|SYN2005]],​ [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2010|SYN2010]],​[[en:​cnk:​syn2013pub|SYN2013PUB]] and [[cnk:​syn2015|SYN2015]];​ additionally,​ it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:​cnk:​syn:​verze4|SYN version 4]], [[en:​cnk:​syn:​verze5|SYN version 5]] and [[en:​cnk:​syn:​verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume ​more than 265 mil. words.+Every **SYN corpus** contains all the [[en:​pojmy:​synchronni|synchronic]] [[en:​pojmy:​psany|written]] corpora of the [[en:​cnk:​syn|SYN]] series published up until the time of the given version'​s publication. The corpus SYN version 7 therefore contains the corpora ​ [[en:​cnk:​syn2000|SYN2000]],​ [[en:​cnk:​syn2005|SYN2005]],​ [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2010|SYN2010]],​[[en:​cnk:​syn2013pub|SYN2013PUB]] and [[cnk:​syn2015|SYN2015]];​ additionally,​ it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:​cnk:​syn:​verze4|SYN version 4]], [[en:​cnk:​syn:​verze5|SYN version 5]] and [[en:​cnk:​syn:​verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume ​almost 200 mil. words.
  
 Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.255 billion words ([[en:​pojmy:​token|tokens]] without punctuation). The SYN corpus is not [[en:​pojmy:​reprezentativnost|representative]];​ the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017. Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.255 billion words ([[en:​pojmy:​token|tokens]] without punctuation). The SYN corpus is not [[en:​pojmy:​reprezentativnost|representative]];​ the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017.