Corpus SYN2006PUB

The SYN2006PUB is a synchronic corpus of written journalism of 300 million of words (tokens). It contains exclusively journalistic texts from November 1989 to the end of 2004, that is the time period covered by corpora SYN2000 and SYN2005. All three corpora are disjunctive as to the texts used, that is no text, which is part of one corpus, is included in the other two. Corpora SYN2000, SYN2005 and SYN2006PUB thus contain a total of 500 million text words (tokens).

Name SYN2006PUB
Positions Number of positions (tokens) 361 224 456
Number of positions (tokens) without punctuation 305 785 705
Number of word forms (words) 2 554 069
Number of lemmas 1 381 900
Structural attributes Number of opera 8 922
Number of documents 1 218 300
Number of sentences 22 339 344
Further information Reference YES
Representative NO (newspapers and magazines)
Publication date 2006

Changes compared to the SYN2005 corpus

The lemmatization and morphological tagging of the SYN2006PUB corpus have been improved in comparison with the SYN2005 corpus, although the difference is not as striking as in the case of SYN2000 and SYN2005 corpora. The system of morphological tags, the tokenization (division of the corpus into words) and segmentation (division into sentences) remains the same as in the SYN2005 corpus.

Composition of the SYN2006PUB corpus

In should be stressed that the SYN2006PUB corpus does not claim to be representative in any way. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2006PUB corpus will thus be appreciated mainly by users who need to work with large amounts of data.

Corpus structure according to years (no. of words in mil.)
Corpus structure according to titles (no. of words in mil.)

Structure of the SYN2006PUB corpus

Among the structural units used in this corpus are <opus>, <doc> and <s>; the text, document and sentence – followed by each individual position. They can be displayed using the menu item View options.

Structural units of the SYN2006PUB corpus.

How to cite SYN2006PUB

Čermák, F. – Doležalová-Spoustová, D. – Hlaváčová, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kopřivová, M. – Křen, M. – Novotná, R. – Petkevič, V. – Schmiedtová, V. – Skoumalová, H. – Šulc, M. – Velíšek, Z.: SYN2006PUB: korpus psané publicistiky. Ústav Českého národního korpusu FF UK, Praha 2006. Available on-line: http://www.korpus.cz

Related links