Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
en:cnk:syn2013pub [2015/10/21 18:38]
Václav Horký Reference: YES
en:cnk:syn2013pub [2016/12/11 10:51] (current)
Veronika Pojarová
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
-====== Corpus SYN2013PUB =======+====== Corpus SYN2013PUB ====== 
 + 
 +The SYN2013PUB is a synchronic corpus of written journalism, a sequel to [[en:​cnk:​SYN2006PUB]] and [[en:​cnk:​SYN2009PUB]]. It contains exclusively journalistic texts from 2005 to 2009, 44 different titles, the total size of the corpus is 935 million of words (tokens). All the [[en:​cnk:​SYN|SYN-series]] corpora are **disjunctive** as to the texts used, that is no text, which is part of one corpus, is included in another one. Overall size of all the SYN-series corpora thus exceeds 2 200 million text words (tokens).
  
 <WRAP right 35 %> <WRAP right 35 %>
Line 7: Line 9:
 ^ ::: ^ Number of positions (tokens) without punctuation | 934 781 949 |  ​ ^ ::: ^ Number of positions (tokens) without punctuation | 934 781 949 |  ​
 ^ ::: ^ Number of word forms (words) | 4 200 464 |    ^ ::: ^ Number of word forms (words) | 4 200 464 |   
-^ ::: ^ Number of lemmata ​| 2 549 185 |+^ ::: ^ Number of lemmas ​| 2 549 185 |
 ^ Structural attributes ^ Number of opera | 21 469 | ^ Structural attributes ^ Number of opera | 21 469 |
 ^ ::: ^ Number of documents | 4 172 882 | ^ ::: ^ Number of documents | 4 172 882 |
Line 15: Line 17:
 ^ ::: ^ Publication date | 2013  | ^ ::: ^ Publication date | 2013  |
 </​WRAP>​ </​WRAP>​
 +===== Changes compared to previous journalistic corpora =====
 +The [[en:​pojmy:​lemma|lemmatization]] and [[en:​pojmy:​tag|morphological tagging]] of SYN2013PUB were improved in comparison with the previous corpora. Apart from that, the tagset itself was further simplified in a few cases when the information provided in the tags would be superfluous,​ hardly recognizable in the context, and thus unreliable. The changes concerned the following:
 +  * removing number for reflexive pronouns
 +  * removing possessor'​s gender for pronouns //jeho//, //jejich//
 +  * removing person and number for the word form //by//
  
-Corpus SYN2013PUB is a synchronic corpus of written journalism, a sequel to [[en:​cnk:​SYN2006PUB]] and [[en:​cnk:​SYN2009PUB]]. It contains exclusively journalistic texts from 2005 to 2009, 44 different titles, the total size of the corpus is 935 million of words (tokens). All the [[en:​cnk:​SYN|SYN-series]] corpora are **disjunctive** as to the texts used, that is no text, which is part of one corpus, is included in another one. Overall size of all the SYN-series corpora thus exceeds 2 200 million text words (tokens). +===== Composition ​of SYN2013PUB ​=====
- +
-The lemmatization and morphological tagging ​of SYN2013PUB ​were improved in comparison with the previous corpora. Apart from that, the tagset itself was further simplified in a few cases when the information provided in the tags would be superfluous,​ hardly recognizable in the context, and thus unreliable. The changes concerned removing number for reflexive pronouns, removing possessor'​s gender for pronouns //jeho//, //jejich// and removing person and number for the word form //by//.+
  
 In should be stressed that the SYN2013PUB corpus does not claim to be representative in any way. The main reason for its compilation was a need for more data available in comparable proportions. After inclusion of SYN2013PUB, corpus [[en:​cnk:​SYN]] in version 3 contains complete volumes of major Czech newspapers from 1998–2009 period. Works on supplementing the synchronic written corpora with newer data are underway. In should be stressed that the SYN2013PUB corpus does not claim to be representative in any way. The main reason for its compilation was a need for more data available in comparable proportions. After inclusion of SYN2013PUB, corpus [[en:​cnk:​SYN]] in version 3 contains complete volumes of major Czech newspapers from 1998–2009 period. Works on supplementing the synchronic written corpora with newer data are underway.
- 
  
 [{{:​en:​cnk:​syn2013pub-roky-en.png?​direct&​325|Corpus structure according to years}}] [{{:​en:​cnk:​syn2013pub-roky-en.png?​direct&​325|Corpus structure according to years}}]
- 
 [{{:​en:​cnk:​syn2013pub-tituly-en.png?​direct&​500|Corpus structure according to titles}}] [{{:​en:​cnk:​syn2013pub-tituly-en.png?​direct&​500|Corpus structure according to titles}}]
  
 +===== Structure of SYN2013PUB =====
  
 +Among the [[en:​pojmy:​atributy_strukturni|structural units]] used in this corpus are ''<​opus>'',​ ''<​doc>''​ and ''<​s>'';​ the text, document and sentence – followed by each individual [[en:​pojmy:​atributy_strukturni#​pozice_jako_strukturni_jednotka|position]].
 +You can have them displayed using the menu item [[en:​manualy:​kontext:​moznosti_zobrazeni|View options]]
  
 +[{{:​cnk:​struktur_znacky.jpg?​300|FIXME Structure units of SYN2013PUB.}}]
  
-====== ​Citing ​SYN2013PUB ======+====== ​How to cite SYN2013PUB ======
  
 <WRAP round tip 70%> <WRAP round tip 70%>
Line 38: Line 45:
 </​WRAP>​ </​WRAP>​
  
-====== ​See also ======+====== ​Related links ======
  
 <WRAP round box 49%> <WRAP round box 49%>
-[[en:​cnk:​syn|SYN]] • [[en:​cnk:​syn2000|SYN2000]] • [[en:​cnk:​syn2005|SYN2005]] • [[en:​cnk:​syn2006pub|SYN2006PUB]] • [[en:​cnk:​syn2009pub|SYN2009PUB]] • [[en:​cnk:​syn2010|SYN2010]]+[[en:​cnk:​syn|SYN]] • [[en:​cnk:​syn2000|SYN2000]] • [[en:​cnk:​syn2005|SYN2005]] • [[en:​cnk:​syn2006pub|SYN2006PUB]] • [[en:​cnk:​syn2009pub|SYN2009PUB]] • [[en:​cnk:​syn2010|SYN2010]] • [[en:​cnk:​syn2015|SYN2015]]
 </​WRAP>​ </​WRAP>​
 +