Corpus SYN2010
SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora SYN2000 and SYN2005 and together with them forms a series of synchronic representative corpora that cover three successive periods. All corpora contain different texts and are therefore disjunctive. The basic characteristic features of the SYN2010 are identical to those of the corpus SYN2005, which is predominantly related to the same conception of representativeness based on the reception of written language and the resulting composition of the corpus. The SYN2010 corpus is lemmatized and morphologically tagged.
Name | SYN2010 | |
---|---|---|
Positions | Number of positions (tokens) | 121 667 413 |
Number of positions (tokens) without punctuation | 101 219 603 | |
Number of word forms (words) | 1 706 345 | |
Number of lemmata | 785 580 | |
Structural attributes | Number of opera | 2 649 |
Number of documents | 152 634 | |
Number of sentences | 8 172 649 | |
Further information | Reference | YES |
Representative | YES | |
Publication date | 2010 |
Changes compared to the SYN2005 corpus
Compared to the corpus SYN2005, the SYN2010 corpus saw significant improvements in lemmatization and morphological tagging; both basically identical to the processing of the SYN2009PUB corpus. Therefore, although SYN2005 and SYN2010 do not differ in their understanding of representativeness, these differences should be taken into account when comparing their lexical frequencies.
Composition of SYN2010
Some of the fiction texts may have been published earlier, but there is a general rule that the corpus consists mainly of newer texts, whereas the proportion of older texts is decreasing. Compared to the SYN2005 corpus, the lemmatization and morphological tagging of the SYN2010 corpus have been significantly improved; both of them correspond with the processing of the SYN2009PUB.
The general composition of SYN2010
More detailed information about the genre composition of the SYN2010 corpus is shown by the CNC’s interactive graph.
Composition of the journalistic texts
The basic characteristic features of the SYN2010 corpus are identical to those of SYN2005, especially the concept of representativeness based on the reception of written language, and the resulting composition of the corpus. All newspaper and magazine texts included into SYN2010 were published in 2005–2009, each year being equally represented – just as in SYN2005. Naturally, the proportion of particular newspaper and magazine titles has changed. However, the criteria that define a synchronic text in both fiction and professional literature remained unchanged; the SYN2010 corpus thus includes solely professional texts published after 1989.
Structure of the SYN 2010 corpus
Among the structural units used in this corpus are <opus>
, <doc>
and <s>
; the text, document and sentence – followed by each individual position. They can be displayed using the menu item View options.
— Michal Křen, Olga Richterová
How to cite SYN2010
Křen, M. – Bartoň, T. – Cvrček, V. – Hnátková, M. – Jelínek, T. – Kocek, J. – Novotná, R. – Petkevič, V. – Procházka, P. – Schmiedtová, V. – Skoumalová, H.: SYN2010: žánrově vyvážený korpus psané češtiny. Ústav Českého národního korpusu FF UK, Praha 2010. Available on-line: http://www.korpus.cz
Related links
SYN • SYN2000 • SYN2005 • SYN2006PUB • SYN2009PUB