Corpus SYN2013PUB

The SYN2013PUB is a synchronic corpus of written journalism, a sequel to SYN2006PUB and SYN2009PUB. It contains exclusively journalistic texts from 2005 to 2009, 44 different titles, the total size of the corpus is 935 million of words (tokens). All the SYN-series corpora are disjunctive as to the texts used, that is no text, which is part of one corpus, is included in another one. Overall size of all the SYN-series corpora thus exceeds 2 200 million text words (tokens).

Name SYN2013PUB
Positions Number of positions (tokens) 1 120 014 835
Number of positions (tokens) without punctuation 934 781 949
Number of word forms (words) 4 200 464
Number of lemmas 2 549 185
Structural attributes Number of opera 21 469
Number of documents 4 172 882
Number of sentences 76 681 361
Further information Reference YES
Representative NO (newspapers and magazines)
Publication date 2013

Changes compared to previous journalistic corpora

The lemmatization and morphological tagging of SYN2013PUB were improved in comparison with the previous corpora. Apart from that, the tagset itself was further simplified in a few cases when the information provided in the tags would be superfluous, hardly recognizable in the context, and thus unreliable. The changes concerned the following:

  • removing number for reflexive pronouns
  • removing possessor's gender for pronouns jeho, jejich
  • removing person and number for the word form by

Composition of SYN2013PUB

In should be stressed that the SYN2013PUB corpus does not claim to be representative in any way. The main reason for its compilation was a need for more data available in comparable proportions. After inclusion of SYN2013PUB, corpus SYN in version 3 contains complete volumes of major Czech newspapers from 1998–2009 period. Works on supplementing the synchronic written corpora with newer data are underway.

Corpus structure according to years
Corpus structure according to titles

Structure of SYN2013PUB

Among the structural units used in this corpus are <opus>, <doc> and <s>; the text, document and sentence – followed by each individual position. You can have them displayed using the menu item View options

Structure units of SYN2013PUB.

How to cite SYN2013PUB

Křen, M. – Hnátková, M. – Jelínek, T. – Petkevič, V. – Procházka, P. – Skoumalová, H.: SYN2013PUB: korpus psané publicistiky. Ústav Českého národního korpusu FF UK, Praha 2013. Available on-line: http://www.korpus.cz

Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): The SYN-series corpora of written Czech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4.

Related links