~~NOTOC~~ ====== Corpus SYN2013PUB ====== The SYN2013PUB is a synchronic corpus of written journalism, a sequel to [[en:cnk:SYN2006PUB]] and [[en:cnk:SYN2009PUB]]. It contains exclusively journalistic texts from 2005 to 2009, 44 different titles, the total size of the corpus is 935 million of words (tokens). All the [[en:cnk:SYN|SYN-series]] corpora are **disjunctive** as to the texts used, that is no text, which is part of one corpus, is included in another one. Overall size of all the SYN-series corpora thus exceeds 2 200 million text words (tokens). ^ Name ^^ SYN2013PUB ^ ^ Positions ^ Number of positions (tokens) | 1 120 014 835 | ^ ::: ^ Number of positions (tokens) without punctuation | 934 781 949 | ^ ::: ^ Number of word forms (words) | 4 200 464 | ^ ::: ^ Number of lemmas | 2 549 185 | ^ Structural attributes ^ Number of opera | 21 469 | ^ ::: ^ Number of documents | 4 172 882 | ^ ::: ^ Number of sentences | 76 681 361 | ^ Further information ^ Reference | YES | ^ ::: ^ Representative | NO (newspapers and magazines) | ^ ::: ^ Publication date | 2013 | ===== Changes compared to previous journalistic corpora ===== The [[en:pojmy:lemma|lemmatization]] and [[en:pojmy:tag|morphological tagging]] of SYN2013PUB were improved in comparison with the previous corpora. Apart from that, the tagset itself was further simplified in a few cases when the information provided in the tags would be superfluous, hardly recognizable in the context, and thus unreliable. The changes concerned the following: * removing number for reflexive pronouns * removing possessor's gender for pronouns //jeho//, //jejich// * removing person and number for the word form //by// ===== Composition of SYN2013PUB ===== In should be stressed that the SYN2013PUB corpus does not claim to be representative in any way. The main reason for its compilation was a need for more data available in comparable proportions. After inclusion of SYN2013PUB, corpus [[en:cnk:SYN]] in version 3 contains complete volumes of major Czech newspapers from 1998–2009 period. Works on supplementing the synchronic written corpora with newer data are underway. [{{:en:cnk:syn2013pub-roky-en.png?direct&325|Corpus structure according to years}}] [{{:en:cnk:syn2013pub-tituly-en.png?direct&500|Corpus structure according to titles}}] ===== Structure of SYN2013PUB ===== Among the [[en:pojmy:atributy_strukturni|structural units]] used in this corpus are '''', '''' and ''''; the text, document and sentence – followed by each individual [[en:pojmy:atributy_strukturni#pozice_jako_strukturni_jednotka|position]]. You can have them displayed using the menu item [[en:manualy:kontext:moznosti_zobrazeni|View options]] [{{:en:cnk:struktur_znacky_13pub.png?direct&400| Structure units of SYN2013PUB.}}] ====== How to cite SYN2013PUB ====== Křen, M. – Hnátková, M. – Jelínek, T. – Petkevič, V. – Procházka, P. – Skoumalová, H.: //SYN2013PUB: korpus psané publicistiky//. Ústav Českého národního korpusu FF UK, Praha 2013. Available on-line: http://www.korpus.cz Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): [[http://www.lrec-conf.org/proceedings/lrec2014/pdf/294_Paper.pdf|The SYN-series corpora of written Czech]]. In //Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)//, 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4. ====== Related links ====== [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:syn2015|SYN2015]]