This is an old revision of the document!
Corpus SYN version 3
Name | SYN version 3 | |
---|---|---|
Position | Number of tokens | 2 685 127 310 |
Number of tokens without punctuation | 2 231 541 041 | |
Number of word forms | 7 604 328 | |
Number of lemmas | 5 170 696 | |
Structures | Number of opuses | 49 882 |
Number of documents | 9 163 021 | |
Number of sentences | 178 499 972 | |
Other information | Referential | YES |
Representative | NO (predominantly journalism) | |
Publication year | 2014 |
Every SYN corpus contains all the synchronic written corpora of the SYN series published up until the time of the given version's publication. The corpus SYN version 3 therefore contains the corpora SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010 and SYN2013PUB.
Because all of these corpora are disjunctive (i.e. they do not contain the same texts), the total size of the SYN version 3 is given by their sum, which makes 2,232 billion words (tokens without punctuation). The SYN corpus is not representative; the dominant component is journalism, which is the result of the predominance of journalistic corpora SYN2006PUB, SYN2009PUB and SYN2013PUB.
The SYN version 3 corpus is referential,and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus.
The composition of the SYN version 3 corpus
Referential written language corpora (synchronic and general) ordered by date of creation | |||||
---|---|---|---|---|---|
corpus | size (words) | lemmatization | morphological tags | publication year | corpus description |
SYN2013PUB | 935 mil. | YES | YES | 2013 | corpus of journalistic texts from the years 2005-2009 |
SYN2010 | 100 mil. | YES | YES | 2010 | representative corpus, mainly texts from the years 2005–2009 |
SYN2009PUB | 700 mil. | YES | YES | 2010 | corpus of journalistic texts from the years 1995–2007 |
SYN2006PUB | 300 mil. | YES | YES | 2006 | corpus of journalistic texts from the years 1989–2004 |
SYN2005 | 100 mil. | YES | YES | 2005 | representative corpus, mainly texts from the years 2000–2004 |
SYN2000 | 100 mil. | YES | YES | 2000 | representative corpus, mainly texts from the years 1990–1999 |
Složení publicistické části korpusu SYN verze 3 pokrývá produkci hlavních celostátních deníků (Mladá fronta DNES, Lidové noviny, Právo, Hospodářské noviny, Blesk) a nespecializovaných časopisů (Reflex, Respekt, Týden) mezi lety 1998–2009. Tabulku s velikostí 15 titulů nejvíce zastoupených v publicistické části korpusu SYN verze 3 (s rozložením po jednotlivých letech; údaje jsou v milionech slov, tj. pozic bez započtení interpunkce) je možné stáhnout níže, náhled složení publicistické části je vidět na následujícím grafu.
How to cite SYN version 3
Křen, M. – Čermák, F. – Hlaváčová, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kopřivová, M. – Novotná, R. – Petkevič, V. – Procházka, P. – Schmiedtová, V. – Skoumalová, H. – Šulc, M.: Korpus SYN, verze 3 z 27. 1. 2014. Ústav Českého národního korpusu FF UK, Praha 2014. Available online: http://www.korpus.cz
Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): The SYN-series corpora of written Czech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4.
— Michal Křen, Olga Richterová
Related links
SYN • SYN verze 4 • SYN2000 • SYN2005 • SYN2006PUB • SYN2009PUB • SYN2010 • SYN2013PUB • SYN2015