This is an old revision of the document!
Corpus SYN version 3
Name | SYN version 3 | |
---|---|---|
Position | Number of tokens | 2 685 127 310 |
Number of tokens without punctuation | 2 231 541 041 | |
Number of word forms | 7 604 328 | |
Number of lemmas | 5 170 696 | |
Structures | Number of opuses | 49 882 |
Number of documents | 9 163 021 | |
Number of sentences | 178 499 972 | |
Other information | Referential | YES |
Representative | NO (predominantly journalism) | |
Publication year | 2014 |
Every SYN corpus contains all the synchronic written corpora of the SYN series published up until the time of the given version's publication. The corpus SYN version 3 therefore contains the corpora SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010 and SYN2013PUB.
Because all of these corpora are disjunctive (i.e. they do not contain the same texts), the total size of the SYN version 3 is given by their sum, which makes 2,232 billion words (tokens without punctuation). The SYN corpus is not representative; the dominant component is journalism, which is the result of the predominance of journalistic corpora SYN2006PUB, SYN2009PUB and SYN2013PUB.
The SYN version 3 corpus is referential,and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus.
The composition of the SYN version 3 corpus
Referential written language corpora (synchronic and general) ordered by date of creation | |||||
---|---|---|---|---|---|
corpus | size (words) | lemmatization | morphological tags | publication year | corpus description |
SYN2013PUB | 935 mil. | YES | YES | 2013 | corpus of journalistic texts from the years 2005-2009 |
SYN2010 | 100 mil. | YES | YES | 2010 | representative corpus, mainly texts from the years 2005–2009 |
SYN2009PUB | 700 mil. | YES | YES | 2010 | corpus of journalistic texts from the years 1995–2007 |
SYN2006PUB | 300 mil. | YES | YES | 2006 | corpus of journalistic texts from the years 1989–2004 |
SYN2005 | 100 mil. | YES | YES | 2005 | representative corpus, mainly texts from the years 2000–2004 |
SYN2000 | 100 mil. | YES | YES | 2000 | representative corpus, mainly texts from the years 1990–1999 |
The composition of the journalistic part of the corpus SYN version 3 covers the production of most of the national daily newspapers (Mladá fronta DNES, Lidové noviny, Právo, Hospodářské noviny, Blesk) and non-specialized magazines (Reflex, Respekt, Týden) between the years 1998–2009. A table containing the 15 titles most represented in the journalistic part of the corpus SYN version 3 (with a layout for the individual years; the numbers are in millions of words, i.e. positions not counting punctuation) can be downloaded below, a preview of the composition of the journalism part can be seen on the following graph.
How to cite SYN version 3
Křen, M. – Čermák, F. – Hlaváčová, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kopřivová, M. – Novotná, R. – Petkevič, V. – Procházka, P. – Schmiedtová, V. – Skoumalová, H. – Šulc, M.: Korpus SYN, verze 3 z 27. 1. 2014. Ústav Českého národního korpusu FF UK, Praha 2014. Available online: http://www.korpus.cz
Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): The SYN-series corpora of written Czech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4.
— Michal Křen, Olga Richterová
Related links
SYN • SYN version 4 • SYN2000 • SYN2005 • SYN2006PUB • SYN2009PUB • SYN2010 • SYN2013PUB • SYN2015