AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn2009pub [2015/10/21 18:51] – Graph of titles in English Václav Horkýen:cnk:syn2009pub [2021/03/16 11:26] (current) – [Structure of the SYN2009PUB corpus] Jan Kocek
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
-====== Corpus SYN2009PUB =======+====== Corpus SYN2009PUB ====== 
 + 
 +The SYN2009PUB is a synchronic corpus of written journalism, a sequel to [[en:cnk:SYN2006PUB]]. It contains exclusively journalistic texts from 1995 to 2007, the total size of the corpus is 700 million of words (tokens). All the SYN-series corpora are disjunctive as to the texts used, that is no text, which is part of one corpus, is included in the other two. Corpora [[en:cnk:SYN2000]], [[en:cnk:SYN2005]], [[en:cnk:SYN2006PUB]] and SYN2009PUB thus contain a total of 1 200 million text words (tokens).
  
 <WRAP right 35 %> <WRAP right 35 %>
Line 16: Line 18:
 </WRAP> </WRAP>
  
-The SYN2009PUB is a synchronic corpus of written journalism, a sequel to [[en:cnk:SYN2006PUB]]. It contains exclusively journalistic texts from 1995 to 2007, the total size of the corpus is 700 million of words (tokens). All the SYN-series corpora are disjunctive as to the texts used, that is no text, which is part of one corpus, is included in the other two. Corpora [[en:cnk:SYN2000]], [[en:cnk:SYN2005]], [[en:cnk:SYN2006PUB]] and SYN2009PUB thus contain a total of 1 200 million text words (tokens).+===== Changes compared to the SYN2006PUB corpus ===== 
 +The [[en:pojmy:lemma|lemmatization]] and [[en:pojmy:tag|morphological tagging]] of SYN2009PUB were improved in comparison with the older corporaThis mainly concerns the following 
 +  * lemmatization of personal and possessive pronouns 
 +  * non-determination of grammatical categories for abbreviations and foreign words 
 +  * tokenization (detection of word form boundaries) -- mainly in case of abbreviations and hyphenated word forms 
 +  * the [[en:pojmy:tag|tagset]] itself was slightly simplifiedthe differences are in elimination of values that grouped together several categories
  
-The lemmatization and morphological tagging of SYN2009PUB were improved in comparison with the older corpora. This concerns mainly lemmatization of personal and possessive pronouns, non-determination of grammatical categories for abbreviations and foreign words, and also the tokenization (detection of word form boundaries) -- mainly in case of abbreviations and hyphenated word forms. The tagset itself was slightly simplified, the differences are in elimination of values that grouped together several categories. +===== Composition of the SYN2009PUB corpus =====
- +
-In should be stressed that the SYN2009PUB corpus does not claim to be representative in any way. Although tens of independent regional newspapers and other titles have been included (in addition to the rather unified Deníky Bohemia and Deníky Moravia), their overall share is very low. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2009PUB corpus will thus be appreciated mainly by users who need to work with large amounts of data. +
  
 +In should be stressed that the SYN2009PUB corpus does not claim to be representative in any way. Although tens of independent regional newspapers and other titles have been included (in addition to the rather unified Deníky Bohemia and Deníky Moravia), their overall share is very low. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2009PUB corpus will thus be appreciated mainly by users who need to work with large amounts of data.
 [{{:cnk:syn2009pub-roky.gif?direct&320|Corpus structure according to years (no. of words in mil.)}}] [{{:cnk:syn2009pub-roky.gif?direct&320|Corpus structure according to years (no. of words in mil.)}}]
-[{{:en:cnk:syn2009pub-slozeni-tituly.gif?direct&400|Corpus structure according to titles (no. of words in mil.)}}]+[{{:en:cnk:syn2009pub-slozeni-tituly-en.gif?direct&400|Corpus structure according to titles (no. of words in mil.)}}]
  
 +===== Structure of the SYN2009PUB corpus =====
  
 +Among the [[en:pojmy:atributy_strukturni|structural units]] used in this corpus are ''<opus>'', ''<doc>'' and ''<s>''; the text, document and sentence – followed by each individual [[en:pojmy:atributy_strukturni#pozice_jako_strukturni_jednotka|position]].
 +They can be displayed using the menu item [[en:manualy:kontext:moznosti_zobrazeni|View options]].
 +[{{:en:cnk:struktur_znacky_09pub.png?direct&400| Structural units of the SYN2009PUB corpus.}}]
  
-====== Citing SYN2009PUB ======+====== How to cite SYN2009PUB ======
  
 <WRAP round tip 75%> <WRAP round tip 75%>
Line 35: Line 45:
 </WRAP> </WRAP>
  
- + --- //Michal Křen, Olga Richterová// 
-====== See also ======+====== Related links ======
 <WRAP round box 49%> <WRAP round box 49%>
 [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] •  [[en:cnk:syn2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]] [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] •  [[en:cnk:syn2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]]
 </WRAP> </WRAP>
 +