Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:cnk:syn:verze7 [2018/12/20 13:00]
Michal Škrabal
en:cnk:syn:verze7 [2018/12/20 13:50] (current)
Michal Křen [How to cite SYN version 7]
Line 8: Line 8:
 ^ ::: ^ Number of [[en:​pojmy:​word|word forms]] ​ |  11 632 632 |  ​ ^ ::: ^ Number of [[en:​pojmy:​word|word forms]] ​ |  11 632 632 |  ​
 ^ ::: ^ Number of [[en:​pojmy:​lemma|lemmas]] |  8 360 795 | ^ ::: ^ Number of [[en:​pojmy:​lemma|lemmas]] |  8 360 795 |
-^ [[en:​pojmy:​atributy_strukturni|Structures]] ^ Number of [[en:​pojmy:​doc|documents]] |  106 350  |+^ [[en:​pojmy:​atributy_strukturni|Structures]] ^ Number of [[en:​pojmy:​doc|documents]] |  106 350 |
 ^ ::: ^ Number of [[en:​pojmy:​atributy_strukturni|texts]]| ​ 16 377 839 | ^ ::: ^ Number of [[en:​pojmy:​atributy_strukturni|texts]]| ​ 16 377 839 |
 ^ ::: ^ Number of sentences |  325 540 933 | ^ ::: ^ Number of sentences |  325 540 933 |
Line 16: Line 16:
 </​WRAP>​ </​WRAP>​
  
-Every **SYN corpus** contains all the [[en:​pojmy:​synchronni|synchronic]] [[en:​pojmy:​psany|written]] corpora of the [[en:​cnk:​syn|SYN]] series published up until the time of the given version'​s publication. The corpus SYN version 7 therefore contains the corpora ​ [[en:​cnk:​syn2000|SYN2000]],​ [[en:​cnk:​syn2005|SYN2005]],​ [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2010|SYN2010]],​[[en:​cnk:​syn2013pub|SYN2013PUB]] and [[cnk:​syn2015|SYN2015]];​ additionally,​ it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:​cnk:​syn:​verze4|SYN version 4]], [[en:​cnk:​syn:​verze5|SYN version 5]] and [[en:​cnk:​syn:​verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume ​more than 265 mil. words.+Every **SYN corpus** contains all the [[en:​pojmy:​synchronni|synchronic]] [[en:​pojmy:​psany|written]] corpora of the [[en:​cnk:​syn|SYN]] series published up until the time of the given version'​s publication. The corpus SYN version 7 therefore contains the corpora ​ [[en:​cnk:​syn2000|SYN2000]],​ [[en:​cnk:​syn2005|SYN2005]],​ [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2010|SYN2010]],​[[en:​cnk:​syn2013pub|SYN2013PUB]] and [[cnk:​syn2015|SYN2015]];​ additionally,​ it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:​cnk:​syn:​verze4|SYN version 4]], [[en:​cnk:​syn:​verze5|SYN version 5]] and [[en:​cnk:​syn:​verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume ​almost 200 mil. words.
  
 Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.255 billion words ([[en:​pojmy:​token|tokens]] without punctuation). The SYN corpus is not [[en:​pojmy:​reprezentativnost|representative]];​ the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017. Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.255 billion words ([[en:​pojmy:​token|tokens]] without punctuation). The SYN corpus is not [[en:​pojmy:​reprezentativnost|representative]];​ the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:​cnk:​syn2006pub|SYN2006PUB]],​ [[en:​cnk:​syn2009pub|SYN2009PUB]],​ [[en:​cnk:​syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017.
Line 52: Line 52:
  
 <WRAP round tip 70%> <WRAP round tip 70%>
-Křen, M. – Cvrček, V. – Čapka, T. – Čermáková,​ A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková,​ D. – Petkevič, V. – Procházka, P. – Skoumalová,​ H. – Škrabal, M. – Truneček, P. – Vondřička,​ P. – Zasina, A.: Corpus SYN, version 7 from 29. 11. 2018. Ústav Českého národního korpusu FF UK, Praha 2018. Available online: http://​www.korpus.cz.+Křen, M. – Cvrček, V. – Čapka, T. – Čermáková,​ A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková,​ D. – Petkevič, V. – Procházka, P. – Skoumalová,​ H. – Škrabal, M. – Truneček, P. – Vondřička,​ P. – Zasina, A.: //Corpus SYN, version 7 from 29. 11. 2018//. Ústav Českého národního korpusu FF UK, Praha 2018. Available online: http://​www.korpus.cz.