AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn2010 [2015/10/22 21:01] – spelling -iz- Václav Horkýen:cnk:syn2010 [2016/12/11 16:27] (current) Veronika Pojarová
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
 ====== Corpus SYN2010 ====== ====== Corpus SYN2010 ======
 +
 +SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]] and together with them forms a series of synchronic representative corpora that cover three successive periods. 
 +**All corpora contain different texts and are therefore disjunctive**. The basic characteristic features of the SYN2010 are identical to those of the corpus [[en:cnk:SYN2005|SYN2005]], which is predominantly related to the same conception of [[en:pojmy:reprezentativnost|representativeness]] based on the reception of written language and the resulting composition of the corpus. The SYN2010 corpus is [[en:pojmy:lemma|lemmatized]] and [[en:pojmy:tag|morphologically tagged]].
 +
  
 <WRAP right 35%> <WRAP right 35%>
Line 16: Line 20:
 </WRAP> </WRAP>
  
-[{{:en:cnk:syn2010-slozeni.gif|Structure of corpus SYN2010: <fc #ffcc00>40 % fiction</fc>, <fc #cc0000>27 % technical literature</fc>, <fc #3333ff>33 % journalism</fc>}}]+====== Changes compared to the SYN2005 corpus ======
  
-SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]] and together with them forms a series of synchronic representative corpora that cover three successive periods+Compared to the corpus [[en:cnk:SYN2005|SYN2005]], the SYN2010 corpus saw **significant improvements in lemmatization** and **[[en:pojmy:tag|morphological tagging]]**; both basically identical to the processing of the [[en:cnk:SYN2009PUB|SYN2009PUB]] corpus. Therefore, although [[en:cnk:SYN2005|SYN2005]] and SYN2010 do not differ in their understanding of [[en:pojmy:reprezentativnost|representativeness]], **these differences should be taken into account** when comparing their lexical frequencies
  
-The basic characteristic features of the SYN2010 corpus are identical to those of SYN2005, especially the concept of representativeness based on the reception of written language, and the resulting composition of the corpus. All newspaper and magazine texts included into SYN2010 were published in 2005--2009, each year being equally represented -- just as in SYN2005. Naturally, the proportion of particular newspaper and magazine titles has changed. However, the criteria that define a synchronic text in both fiction and professional literature remained unchanged; the SYN2010 corpus thus includes solely professional texts published after 1989. +====== Composition of SYN2010 ======
  
 Some of the fiction texts may have been published earlier, but there is a general rule that the corpus consists mainly of newer texts, whereas the proportion of older texts is decreasing. Compared to the SYN2005 corpus, the lemmatization and morphological tagging of the SYN2010 corpus have been significantly improved; both of them correspond with the processing of the [[en:cnk:SYN2009PUB]]. Some of the fiction texts may have been published earlier, but there is a general rule that the corpus consists mainly of newer texts, whereas the proportion of older texts is decreasing. Compared to the SYN2005 corpus, the lemmatization and morphological tagging of the SYN2010 corpus have been significantly improved; both of them correspond with the processing of the [[en:cnk:SYN2009PUB]].
  
 +===== The general composition of SYN2010 =====
  
 +[{{:en:cnk:syn2010-slozeni.gif|Structure of corpus SYN2010: <fc #ffcc00>40 % fiction</fc>, <fc #cc0000>27 % technical literature</fc>, <fc #3333ff>33 % journalism</fc>}}]
 <WRAP clear></WRAP> <WRAP clear></WRAP>
  
-[{{:en:cnk:syn2010-slozeni-odborna-en.gif|Structure of technical and other specialized literature according to thematic orientation (noof words  in mil.)}}]+More detailed information about the genre composition of the SYN2010 corpus is shown by the CNC’s [[https://ucnk.ff.cuni.cz/slozeni2010.php|interactive graph]].
  
-[{{:en:cnk:syn2010-slozeni-publicistika-roky-en.gif|Structure of journalism according to the year of issue (no. of words  in mil.)}}]+==== Composition of the journalistic texts ====
  
 +The basic characteristic features of the SYN2010 corpus are identical to those of SYN2005, especially the concept of representativeness based on the reception of written language, and the resulting composition of the corpus. All newspaper and magazine texts included into SYN2010 were published in 2005--2009, each year being equally represented -- just as in SYN2005. Naturally, the proportion of particular newspaper and magazine titles has changed. However, the criteria that define a synchronic text in both fiction and professional literature remained unchanged; the SYN2010 corpus thus includes solely professional texts published after 1989.
 +
 +<WRAP clear></WRAP>
 +[{{:en:cnk:syn2010-slozeni-odborna-en.gif|Structure of technical and other specialized literature according to thematic orientation (no. of words  in mil.)}}]
 +[{{:en:cnk:syn2010-slozeni-publicistika-roky-en.gif|Structure of journalism according to the year of issue (no. of words  in mil.)}}]
 [{{:en:cnk:syn2010-slozeni-publicistika-tituly-en.gif|Structure of journalism according to the newspaper title (no. of words  in mil.)}}] [{{:en:cnk:syn2010-slozeni-publicistika-tituly-en.gif|Structure of journalism according to the newspaper title (no. of words  in mil.)}}]
  
 +===== Structure of the SYN 2010 corpus =====
  
-====== Citing SYN2010 ======+Among the [[en:pojmy:atributy_strukturni|structural units]] used in this corpus are ''<opus>'', ''<doc>'' and ''<s>''; the text, document and sentence – followed by each individual [[en:pojmy:atributy_strukturni#pozice_jako_strukturni_jednotka|position]]. They can be displayed using the menu item [[en:manualy:kontext:moznosti_zobrazeni|View options]]. 
 + 
 +{{:cnk:strukturni_znacky.png?direct&300|Structural units and their attributes in the corpus manager}} 
 + 
 + 
 +--- //Michal Křen, Olga Richterová// 
 + 
 + 
 +====== How to cite SYN2010 ======
  
 <WRAP round tip 70%> <WRAP round tip 70%>
Line 42: Line 61:
  
  
-====== See also ======+====== Related links ======
  
 <WRAP round box 48%> <WRAP round box 48%>
 [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]]
 </WRAP> </WRAP>
 +