AplikaceAplikace
Nastavení

This is an old revision of the document!


Corpus SYN version 3

Name SYN version 3
Position Number of tokens 2 685 127 310
Number of tokens without punctuation 2 231 541 041
Number of word forms 7 604 328
Number of lemmas 5 170 696
Structures Number of opuses 49 882
Number of documents 9 163 021
Number of sentences 178 499 972
Other information Referential YES
Representative NO (predominantly journalism)
Publication year 2014

Every SYN corpus contains all the synchronic written corpora of the SYN series published up until the time of the given version's publication. The corpus SYN version 3 therefore contains the corpora SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010 and SYN2013PUB.

Because all of these corpora are disjunctive (i.e. they do not contain the same texts), the total size of the SYN version 3 is given by their sum, which makes 2,232 billion words (tokens without punctuation). The SYN corpus is not representative; the dominant component is journalism, which is the result of the predominance of journalistic corpora SYN2006PUB, SYN2009PUB and SYN2013PUB.

The SYN version 3 corpus is referential,and will remain accessible to users even after newer versions have been published. It is however necessary to keep in mind that the linguistic information will become outdated, as a natural result of the referential nature of the corpus.

The composition of the SYN version 3 corpus

Referential written language corpora (synchronic and general) ordered by date of creation
corpus size (words) lemmatization morphological tags publication year corpus description
SYN2013PUB 935 mil. YES YES 2013 corpus of journalistic texts from the years 2005-2009
SYN2010 100 mil. YES YES 2010 representative corpus, mainly texts from the years 2005–2009
SYN2009PUB 700 mil. YES YES 2010 corpus of journalistic texts from the years 1995–2007
SYN2006PUB 300 mil. YES YES 2006 corpus of journalistic texts from the years 1989–2004
SYN2005 100 mil. YES YES 2005 representative corpus, mainly texts from the years 2000–2004
SYN2000 100 mil. YES YES 2000 representative corpus, mainly texts from the years 1990–1999

The composition of the journalistic part of the corpus SYN version 3 covers the production of most of the national daily newspapers (Mladá fronta DNES, Lidové noviny, Právo, Hospodářské noviny, Blesk) and non-specialized magazines (Reflex, Respekt, Týden) between the years 1998–2009. A table containing the 15 titles most represented in the journalistic part of the corpus SYN version 3 (with a layout for the individual years; the numbers are in millions of words, i.e. positions not counting punctuation) can be downloaded below, a preview of the composition of the journalism part can be seen on the following graph.

Composition of the journalism part of SYN version 3

Preview of the composition of the journalism part of SYN version 3

How to cite SYN version 3

Křen, M. – Čermák, F. – Hlaváčová, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kopřivová, M. – Novotná, R. – Petkevič, V. – Procházka, P. – Schmiedtová, V. – Skoumalová, H. – Šulc, M.: Corpus SYN, version 3 from 27. 1. 2014. Ústav Českého národního korpusu FF UK, Praha 2014. Available online: http://www.korpus.cz

Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): The SYN-series corpora of written Czech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4.

Michal Křen, Olga Richterová

Related links