Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:syn:verze7 [2018/12/20 12:37] – michalskrabal | en:cnk:syn:verze7 [2018/12/20 13:50] (current) – [How to cite SYN version 7] michalkren |
---|
^ <fs medium>Name</fs> ^^ <fs medium>SYN version 7</fs> ^ | ^ <fs medium>Name</fs> ^^ <fs medium>SYN version 7</fs> ^ |
^ [[pojmy:atributy_pozicni|Position]] ^ Number of tokens | 5 100 437 261 | | ^ [[pojmy:atributy_pozicni|Position]] ^ Number of tokens | 5 100 437 261 | |
^ ::: ^ Number of tokens without punctuation | 4 033 268 842 | | ^ ::: ^ Number of tokens without punctuation | 4 255 216 412 | |
^ ::: ^ Number of [[en:pojmy:word|word forms]] | 11 632 632 | | ^ ::: ^ Number of [[en:pojmy:word|word forms]] | 11 632 632 | |
^ ::: ^ Number of [[en:pojmy:lemma|lemmas]] | 8 360 795 | | ^ ::: ^ Number of [[en:pojmy:lemma|lemmas]] | 8 360 795 | |
^ [[en:pojmy:atributy_strukturni|Structures]] ^ Number of [[en:pojmy:doc|documents]] | 106 350 | | ^ [[en:pojmy:atributy_strukturni|Structures]] ^ Number of [[en:pojmy:doc|documents]] | 106 350 | |
^ ::: ^ Number of [[en:pojmy:atributy_strukturni|texts]]| 16 377 839 | | ^ ::: ^ Number of [[en:pojmy:atributy_strukturni|texts]]| 16 377 839 | |
^ ::: ^ Number of sentences | 325 540 933 | | ^ ::: ^ Number of sentences | 325 540 933 | |
</WRAP> | </WRAP> |
| |
Every **SYN corpus** contains all the [[en:pojmy:synchronni|synchronic]] [[en:pojmy:psany|written]] corpora of the [[en:cnk:syn|SYN]] series published up until the time of the given version's publication. The corpus SYN version 7 therefore contains the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]],[[en:cnk:syn2013pub|SYN2013PUB]] and [[cnk:syn2015|SYN2015]]; additionally, it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:cnk:syn:verze4|SYN version 4]], [[en:cnk:syn:verze5|SYN version 5]] and [[en:cnk:syn:verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume more than 265 mil. words. | Every **SYN corpus** contains all the [[en:pojmy:synchronni|synchronic]] [[en:pojmy:psany|written]] corpora of the [[en:cnk:syn|SYN]] series published up until the time of the given version's publication. The corpus SYN version 7 therefore contains the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]],[[en:cnk:syn2013pub|SYN2013PUB]] and [[cnk:syn2015|SYN2015]]; additionally, it contains a journalistic component predominantly from the years 2010–2014 (already included into [[en:cnk:syn:verze4|SYN version 4]], [[en:cnk:syn:verze5|SYN version 5]] and [[en:cnk:syn:verze6|SYN version 6]]) and as yet **unpublished journalistic texts from 2017** in yearly volume almost 200 mil. words. |
| |
Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.033 billion words ([[en:pojmy:token|tokens]] without punctuation). The SYN corpus is not [[en:pojmy:reprezentativnost|representative]]; the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017. | Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 7 is given by their sum, which makes 4.255 billion words ([[en:pojmy:token|tokens]] without punctuation). The SYN corpus is not [[en:pojmy:reprezentativnost|representative]]; the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2013pub|SYN2013PUB]] and the journalistic component from the years 2010--2017. |
| |
The SYN version 7 corpus is [[en:pojmy:referencni|referential]], and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus. Individual versions of the corpus SYN will continue to be published regularly every year with the addition of current journalistic data, and every new addition will be given the attribute value ''<doc syn>'' equal to the version of the SYN corpus in which the given text first appeared; for example a [[en:pojmy:subkorpus|subcorpus]] corresponding to the above mentioned (as yet unpublished) journalistic component can be [[en:manualy:kontext:subkorpus#vytvoreni_noveho_subkorpusu|created]] from the SYN version 7 with the help of the condition ''syn=<nowiki>"</nowiki>v7<nowiki>"</nowiki>''. | The SYN version 7 corpus is [[en:pojmy:referencni|referential]], and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus. Individual versions of the corpus SYN will continue to be published regularly every year with the addition of current journalistic data, and every new addition will be given the attribute value ''<doc syn>'' equal to the version of the SYN corpus in which the given text first appeared; for example a [[en:pojmy:subkorpus|subcorpus]] corresponding to the above mentioned (as yet unpublished) journalistic component can be [[en:manualy:kontext:subkorpus#vytvoreni_noveho_subkorpusu|created]] from the SYN version 7 with the help of the condition ''syn=<nowiki>"</nowiki>v7<nowiki>"</nowiki>''. |
| |
<WRAP round tip 70%> | <WRAP round tip 70%> |
Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková, D. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Škrabal, M. – Truneček, P. – Vondřička, P. – Zasina, A.: Corpus SYN, version 7 from 29. 11. 2018. Ústav Českého národního korpusu FF UK, Praha 2018. Available online: http://www.korpus.cz. | Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková, D. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Škrabal, M. – Truneček, P. – Vondřička, P. – Zasina, A.: //Corpus SYN, version 7 from 29. 11. 2018//. Ústav Českého národního korpusu FF UK, Praha 2018. Available online: http://www.korpus.cz. |
| |
| |