~~NOTOC~~
====== Corpus SYN2013PUB ======
The SYN2013PUB is a synchronic corpus of written journalism, a sequel to [[en:cnk:SYN2006PUB]] and [[en:cnk:SYN2009PUB]]. It contains exclusively journalistic texts from 2005 to 2009, 44 different titles, the total size of the corpus is 935 million of words (tokens). All the [[en:cnk:SYN|SYN-series]] corpora are **disjunctive** as to the texts used, that is no text, which is part of one corpus, is included in another one. Overall size of all the SYN-series corpora thus exceeds 2 200 million text words (tokens).
^ Name ^^ SYN2013PUB ^
^ Positions ^ Number of positions (tokens) | 1 120 014 835 |
^ ::: ^ Number of positions (tokens) without punctuation | 934 781 949 |
^ ::: ^ Number of word forms (words) | 4 200 464 |
^ ::: ^ Number of lemmas | 2 549 185 |
^ Structural attributes ^ Number of opera | 21 469 |
^ ::: ^ Number of documents | 4 172 882 |
^ ::: ^ Number of sentences | 76 681 361 |
^ Further information ^ Reference | YES |
^ ::: ^ Representative | NO (newspapers and magazines) |
^ ::: ^ Publication date | 2013 |
===== Changes compared to previous journalistic corpora =====
The [[en:pojmy:lemma|lemmatization]] and [[en:pojmy:tag|morphological tagging]] of SYN2013PUB were improved in comparison with the previous corpora. Apart from that, the tagset itself was further simplified in a few cases when the information provided in the tags would be superfluous, hardly recognizable in the context, and thus unreliable. The changes concerned the following:
* removing number for reflexive pronouns
* removing possessor's gender for pronouns //jeho//, //jejich//
* removing person and number for the word form //by//
===== Composition of SYN2013PUB =====
In should be stressed that the SYN2013PUB corpus does not claim to be representative in any way. The main reason for its compilation was a need for more data available in comparable proportions. After inclusion of SYN2013PUB, corpus [[en:cnk:SYN]] in version 3 contains complete volumes of major Czech newspapers from 1998–2009 period. Works on supplementing the synchronic written corpora with newer data are underway.
[{{:en:cnk:syn2013pub-roky-en.png?direct&325|Corpus structure according to years}}]
[{{:en:cnk:syn2013pub-tituly-en.png?direct&500|Corpus structure according to titles}}]
===== Structure of SYN2013PUB =====
Among the [[en:pojmy:atributy_strukturni|structural units]] used in this corpus are '''', '''' and ''''; the text, document and sentence – followed by each individual [[en:pojmy:atributy_strukturni#pozice_jako_strukturni_jednotka|position]].
You can have them displayed using the menu item [[en:manualy:kontext:moznosti_zobrazeni|View options]]
[{{:en:cnk:struktur_znacky_13pub.png?direct&400| Structure units of SYN2013PUB.}}]
====== How to cite SYN2013PUB ======
Křen, M. – Hnátková, M. – Jelínek, T. – Petkevič, V. – Procházka, P. – Skoumalová, H.: //SYN2013PUB: korpus psané publicistiky//. Ústav Českého národního korpusu FF UK, Praha 2013. Available on-line: http://www.korpus.cz
Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): [[http://www.lrec-conf.org/proceedings/lrec2014/pdf/294_Paper.pdf|The SYN-series corpora of written Czech]]. In //Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)//, 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4.
====== Related links ======
[[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:syn2015|SYN2015]]