~~NOTOC~~
====== Corpus SYN version 3 ======

<WRAP right 35%>
^ <fs medium>Name</fs> ^^ <fs medium>SYN version 3</fs> ^
^ [[pojmy:atributy_pozicni|Position]] ^ Number of tokens |  2 685 127 310 |  
^ ::: ^ Number of tokens without punctuation |  2 231 541 041 |  
^ ::: ^ Number of [[en:pojmy:word|word forms]] |  7 604 328 |  
^ ::: ^ Number of [[en:pojmy:lemma|lemmas]] |  5 170 696 |
^ [[en:pojmy:atributy_strukturni|Structures]] ^ Number of [[en:pojmy:opus|opuses]] |  49 882 |
^ ::: ^ Number of [[en:pojmy:atributy_strukturni|documents]] |  9 163 021 |
^ ::: ^ Number of sentences |  178 499 972 |
^ Other information ^ [[en:pojmy:referencni|Referential]] |  YES |  
^ ::: ^ [[en:pojmy:reprezentativnost|Representative]] |  NO (predominantly journalism) |  
^ ::: ^ Publication year |  2014 |
</WRAP>

Every **SYN corpus** contains all the [[en:pojmy:synchronni|synchronic]] [[en:pojmy:psany|written]] corpora of the [[en:cnk:syn|SYN]] series published up until the time of the given version's publication. The corpus SYN version 3 therefore contains the corpora  [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]].

Because all of these corpora are **disjunctive** (i.e. they do not contain the same texts), the total size of the SYN version 3 is given by their sum, which makes 2,232 billion words ([[en:pojmy:token|tokens]] without punctuation). The SYN corpus is not  [[en:pojmy:reprezentativnost|representative]]; the dominant component is journalism, which is the result of the predominance of journalistic corpora [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]] and [[en:cnk:syn2013pub|SYN2013PUB]].

The SYN version 3 corpus is [[en:pojmy:referencni|referential]],and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus.

====== The composition of the SYN version 3 corpus ======

^ <fs medium>Referential written language corpora (synchronic and general) ordered by date of creation</fs> ^^^^^^
^ corpus ^ size (words) ^ [[en:pojmy:lemma|lemmatization]] ^ [[en:pojmy:tag|morphological tags]] ^ publication year ^ corpus description ^
^ [[en:cnk:syn2013PUB|SYN2013PUB]] | 935 mil. |  ✓  |  ✓  |  2013  | corpus of journalistic texts from the years 2005-2009 |
^ [[en:cnk:syn2010|SYN2010]] | 100 mil. |  ✓  |  ✓  |  2010  | representative corpus, mainly texts from the years  2005–2009|
^ [[en:cnk:syn2009PUB|SYN2009PUB]] | 700 mil. |  ✓  |  ✓  |  2010  | corpus of journalistic texts from the years 1995–2007 |
^ [[en:cnk:syn2006PUB|SYN2006PUB]] | 300 mil. |  ✓  |  ✓  |  2006  | corpus of journalistic texts from the years 1989–2004|
^ [[en:cnk:syn2005|SYN2005]] | 100 mil. |  ✓  |  ✓  |  2005  | representative corpus, mainly texts from the years  2000–2004|
^ [[en:cnk:syn2000|SYN2000]] | 100 mil. |  ✓  |  ✓  |  2000  | representative corpus, mainly texts from the years 1990–1999|

The composition of the journalistic part of the corpus SYN version 3 covers the production of most of the national daily newspapers (Mladá fronta DNES, Lidové noviny, Právo, Hospodářské noviny, Blesk) and non-specialized magazines (Reflex, Respekt, Týden) between the years 1998--2009. A table containing the 15 titles most represented in the journalistic part of the corpus SYN version 3 (with a layout for the individual years; the numbers are in millions of words, i.e. positions not counting punctuation) can be downloaded below, a preview of the composition of the journalism part can be seen on the following graph. 

{{:cnk:slozeni_syn_v3.ods|Composition of the journalism part of SYN version 3}}
[{{:cnk:slozeni_syn_v3.png?400|Preview of the composition of the journalism part of SYN version 3}}]

====== How to cite SYN version 3 ======

<WRAP round tip 70%>
Křen, M. – Čermák, F. – Hlaváčová, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kopřivová, M. – Novotná, R. – Petkevič, V. – Procházka, P. – Schmiedtová, V. – Skoumalová, H. – Šulc, M.: //Corpus SYN, version 3 from 27. 1. 2014//. Ústav Českého národního korpusu FF UK, Praha 2014. Available online: http://www.korpus.cz

Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): [[http://www.lrec-conf.org/proceedings/lrec2014/pdf/294_Paper.pdf|The SYN-series corpora of written Czech]]. In //Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)//, 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4. 
</WRAP>


 --- //Michal Křen, Olga Richterová//

====== Related links ======
<WRAP round box 50%>
[[en:cnk:syn|SYN]] • [[en:cnk:syn:verze4|SYN version 4]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]] • [[en:cnk:syn2015|SYN2015]]
</WRAP>