AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn2015 [2016/06/30 13:58] Václav Cvrčeken:cnk:syn2015 [2020/09/01 17:46] (current) – [Positional annotation and tagging] Michal Křen
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
 ====== Corpus SYN2015 ====== ====== Corpus SYN2015 ======
 +
 +SYN2015 is a representative corpus of contemporary written Czech published in December 2015. SYN2015 is a sequel of the representative corpora of the SYN series ([[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2010|SYN2010]]), but at the same time, it reflects necessary methodological and technological enhancements outlined below.
 +
  
 <WRAP right 35%> <WRAP right 35%>
Line 16: Line 19:
 ^ ::: ^ Publication year |  2015 | ^ ::: ^ Publication year |  2015 |
 </WRAP> </WRAP>
 +===== Changes compared to other SYN series corpora =====
  
-SYN2015 is a representative corpus of contemporary written Czech published in December 2015. SYN2015 is a sequel of the representative corpora of the SYN series ([[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2010|SYN2010]]), but at the same time, it reflects necessary methodological and technological enhancements outlined below.  +==== The concept of written language in  SYN2015 ====
- +
-Approach adopted to **representativeness** differs from previous corpora of the SYN-series. SYN2015 is designed to contain a large number of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber's notion of representativeness in terms of //texts as products//. Unlike the previous corpora in this series, SYN2015 is designed as representative, but not claimed to be balanced.+
  
 SYN2015 is designed as a representation of **contemporary printed language** of the last five-year period, i.e. 2010–2014. As the borders of synchronicity vary across the registers, the following criteria for inclusion of the individual texts into SYN2015 have been adopted (based on the three top-level categories, cf. below): SYN2015 is designed as a representation of **contemporary printed language** of the last five-year period, i.e. 2010–2014. As the borders of synchronicity vary across the registers, the following criteria for inclusion of the individual texts into SYN2015 have been adopted (based on the three top-level categories, cf. below):
Line 25: Line 27:
   * non-fiction: first publication date within the last 25 years;   * non-fiction: first publication date within the last 25 years;
   * newspapers and magazines: publication date within the given five-year period.   * newspapers and magazines: publication date within the given five-year period.
 +==== Representativeness in SYN2015 ====
  
-The original **text classification** scheme of the SYN series has been updated and revised; both original and revised classifications are based on text-external criteria that reflect predominant function of a text. The revision has been made with respect to comparability with the original scheme, with the most significant change made to the sub-classification of non-fiction adopted from the [[http://www.en.nkp.cz|Czech National Library]] and more detailed classification of newspaper texts.+The approach adopted to **representativeness** differs from previous corpora of the SYN-series. SYN2015 contains a large spectrum of different types of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber's notion of representativeness in terms of //texts as products//. Unlike the previous corpora in this series, SYN2015 is designed as representative, but not claimed to be balanced. 
 + 
 +==== Text classification ==== 
 + 
 +The original **text classification** scheme of the SYN series has been [[https://wiki.korpus.cz/doku.php/en:cnk:klasifikace_textu_syn2015|updated and revised]]; both original and revised classifications are based on text-external criteria that reflect predominant function of a text. The revision has been made with respect to comparability with the original scheme, with the most significant change made to the sub-classification of non-fiction adopted from the [[http://www.en.nkp.cz|Czech National Library]] and more detailed classification of newspaper texts. 
 + 
 +^ Txtype_group ^ Portion ^ 
 +| FIC: fiction |  33,33 % | 
 +| NFC: non-fiction |  33,33 % | 
 +| NMG: newspapers and magazines |  33,33 % | 
 + 
 +[{{:en:cnk:nfc-en.png?direct&400|Composition of non-fiction (NFC) part of the SYN2015}}] 
 +[{{:en:cnk:roky-nmg-en.png?direct&400|Proportion of traditional and leisure journalism within the newspapers and magazines in each year}}] 
 + 
 + 
 +<WRAP clear></WRAP> 
 + 
 +In line with its predecessors, SYN2015 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2015 are set arbitrarily, yet close to the original figures.  
 + 
 +Next to the text type and genre, metadata related to the text classification and available for every document also include medium (book, journal, textbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (general, children/youth). Standard division of the newspapers into the individual articles is also supplemented by their separate classification into 13 sections (politics, economics, sports, culture, leisure, commentaries etc.) and information about the author that is available for all prominent newspaper titles. 
 + 
 +A more detailed description of the text types contained within the macro groups:
  
 ^  txtype  ^  genre / genre_group  ^  category  ^  proportion  ^ ^  txtype  ^  genre / genre_group  ^  category  ^  proportion  ^
Line 36: Line 60:
 | X | | other |  0,33 % | | X | | other |  0,33 % |
 | **Non-fiction** (NFC) |||  33,33 % | | **Non-fiction** (NFC) |||  33,33 % |
-| SCI/PRO/POP | HUM | humanities |  7 % |+| SCI (scientific)\\ \\ PRO (professional)\\ \\ POP (popular) | HUM | humanities |  7 % |
 | ::: | SSC | social sciences |  7 % | | ::: | SSC | social sciences |  7 % |
 | ::: | NAT | natural sciences |  7 % | | ::: | NAT | natural sciences |  7 % |
Line 49: Line 73:
 | LEI | | leisure magazines |  13,33 % | | LEI | | leisure magazines |  13,33 % |
  
-In line with its predecessors, SYN2015 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variablestext type and genreProportions of the particular categories in SYN2015 are set arbitrarily, yet close to the original figures+A detailed information about the text classification scheme is available [[https://wiki.korpus.cz/doku.php/en:cnk:klasifikace_textu_syn2015|here]].
  
-Next to the text type and genremetadata related to the text classification and available for every document also include medium (book, journaltextbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (generalchildren/youth). Standard division of the newspapers into the individual articles is also supplemented by their separate classification into 13 sections (politicseconomicssports, culture, leisure, commentaries etc.) and information about the author that is available for all prominent newspaper titles.+==== Concept of synchronicity ==== 
 + 
 +We are working under the assumption that a [[en:pojmy:synchronni|synchronic]] text is one that is still being read (or published)which is indicated by the year of publication. The boundaries of synchrony differ for each of the three macro groups: 
 + 
 +  * for fiction it is 25 + 75i.e. the time elapsed since the first publication is less than 75 years (approximately three living generationsand the given issue of the text being added to the corpus is no older than 25 years (ensuring reception in the present), 
 +  * for non-fiction texts the first issue must be no older than 25 years, 
 +  * the boundaries for the synchrony of newspapers and magazines remains unchangedi.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2015 it is the period between 2010 and 2014). 
 + 
 +The resulting makeup of the corpus in no. of words over the years is summarized by the following graph. 
 + 
 + [{{:en:cnk:roky-en.png?direct&600|Proportion of fictionnon-fictionnewspapers and magazines in each year}}] 
 + 
 +==== Positional annotation and tagging ==== 
 + 
 +Compared to previous versions there have been improvements in [[en:pojmy:lemma|lemmatization]] and [[en:pojmy:morfologicka_analyza|morphological tagging]]; both are almost identical to the processes used for the corpus [[en:cnk:syn2013pub|SYN2013PUB]], nonetheless SYN2015 was processed using the newest versions of all the tools (the improvements relate both to the morphological dictionary and to the rule-based [[en:pojmy:desambiguace|disambiguation]]). Furthermore, the lemmatization of punctuation marks has changed, preserving the form of the characters as much as possible
  
-[{{:en:cnk:roky-en.png?nolink&400|Proportion of fiction, non-fiction, newspapers and magazines in each year}}] +Last but not least, SYN2015 is the first CNC corpus featuring a **[[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_analyza|syntactic annotation]]**.
-[{{:en:cnk:roky-nmg-en.png?nolink&400|Proportion of traditional and leisure journalism within the newspapers and magazines in each year}}] +
-[{{:en:cnk:nfc-en.png?nolink&400|Composition of non-fiction (NFC) part of the SYN2015}}]+
  
 ====== How to cite SYN2015 ====== ====== How to cite SYN2015 ======
Line 65: Line 101:
  
 Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková, D. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Škrabal, M. – Truneček, P. – Vondřička, P. – Zasina, A. (2016): [[http://www.lrec-conf.org/proceedings/lrec2016/pdf/186_Paper.pdf|SYN2015: Representative Corpus of Contemporary Written Czech]]. In: //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)//, 2522–2528. Portorož: ELRA. ISBN 978-2-9517408-9-1. Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková, D. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Škrabal, M. – Truneček, P. – Vondřička, P. – Zasina, A. (2016): [[http://www.lrec-conf.org/proceedings/lrec2016/pdf/186_Paper.pdf|SYN2015: Representative Corpus of Contemporary Written Czech]]. In: //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)//, 2522–2528. Portorož: ELRA. ISBN 978-2-9517408-9-1.
 +</WRAP>
 +
 +====== Related links ======
 +
 +<WRAP round box 49%>
 +[[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:syn2013PUB|SYN2013PUB]]
 </WRAP> </WRAP>