AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
en:cnk:syn2006pub [2015/10/21 18:57] – created vaclavhorkyen:cnk:syn2006pub [2021/03/16 11:25] (current) – [Structure of the SYN2006PUB corpus] jankocek
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
-====== Corpus SYN2006PUB =======+====== Corpus SYN2006PUB ====== 
 + 
 +The SYN2006PUB is a synchronic corpus of written journalism of 300 million of words (tokens). It contains exclusively journalistic texts from November 1989 to the end of 2004, that is the time period covered by corpora [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]]. All three corpora are disjunctive as to the texts used, that is no text, which is part of one corpus, is included in the other two. Corpora [[en:cnk:SYN2000]], [[en:cnk:SYN2005]] and SYN2006PUB thus contain a total of 500 million text words (tokens).
  
 <WRAP right 35%> <WRAP right 35%>
Line 7: Line 9:
 ^ ::: ^ Number of positions (tokens) without punctuation | 305 785 705 |   ^ ::: ^ Number of positions (tokens) without punctuation | 305 785 705 |  
 ^ ::: ^ Number of word forms (words) | 2 554 069 |    ^ ::: ^ Number of word forms (words) | 2 554 069 |   
-^ ::: ^ Number of lemmata | 1 381 900 |+^ ::: ^ Number of lemmas | 1 381 900 |
 ^ Structural attributes ^ Number of opera | 8 922 | ^ Structural attributes ^ Number of opera | 8 922 |
 ^ ::: ^ Number of documents | 1 218 300 | ^ ::: ^ Number of documents | 1 218 300 |
Line 15: Line 17:
 ^ ::: ^ Publication date | 2006 | ^ ::: ^ Publication date | 2006 |
 </WRAP> </WRAP>
- +===== Changes compared to the SYN2005 corpus =====
-The SYN2006PUB is a synchronic corpus of written journalism of 300 million of words (tokens). It contains exclusively journalistic texts from November 1989 to the end of 2004, that is the time period covered by corpora [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]]. All three corpora are disjunctive as to the texts used, that is no text, which is part of one corpus, is included in the other two. Corpora [[en:cnk:SYN2000]], [[en:cnk:SYN2005]] and SYN2006PUB thus contain a total of 500 million text words (tokens).+
  
 The lemmatization and morphological tagging of the SYN2006PUB corpus have been improved in comparison with the [[en:cnk:SYN2005]] corpus, although the difference is not as striking as in the case of [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]] corpora. The system of morphological tags, the tokenization (division of the corpus into words) and segmentation (division into sentences) remains the same as in the [[en:cnk:SYN2005]] corpus. The lemmatization and morphological tagging of the SYN2006PUB corpus have been improved in comparison with the [[en:cnk:SYN2005]] corpus, although the difference is not as striking as in the case of [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]] corpora. The system of morphological tags, the tokenization (division of the corpus into words) and segmentation (division into sentences) remains the same as in the [[en:cnk:SYN2005]] corpus.
  
-In should be stressed that the SYN2006PUB corpus does not claim to be representative in any way. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2006PUB corpus will thus be appreciated mainly by users who need to work with large amounts of data. +===== Composition of the SYN2006PUB corpus =====
  
 +In should be stressed that the SYN2006PUB corpus does not claim to be representative in any way. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2006PUB corpus will thus be appreciated mainly by users who need to work with large amounts of data.
 <WRAP clear></WRAP> <WRAP clear></WRAP>
- 
  
 [{{:cnk:syn2006pub-roky.gif?direct&370|Corpus structure according to years (no. of words in mil.)}}] [{{:cnk:syn2006pub-roky.gif?direct&370|Corpus structure according to years (no. of words in mil.)}}]
-[{{:cnk:syn2006pub-slozeni-tituly-en.gif?direct&378|Corpus strucutre according to titles (no. of words in mil.)}}]+[{{:cnk:syn2006pub-slozeni-tituly-en.gif?direct&378|Corpus structure according to titles (no. of words in mil.)}}]
  
 +====== Structure of the SYN2006PUB corpus ======
  
 +Among the [[en:pojmy:atributy_strukturni|structural units]] used in this corpus are ''<opus>'', ''<doc>'' and ''<s>''; the text, document and sentence – followed by each individual [[en:pojmy:atributy_strukturni#pozice_jako_strukturni_jednotka|position]].
 +They can be displayed using the menu item [[en:manualy:kontext:moznosti_zobrazeni|View options]].
  
-====== Citing SYN2006PUB ======+[{{:en:cnk:struktur_znacky_06pub.png?direct&400| Structural units of the SYN2006PUB corpus.}}] 
 + 
 +====== How to cite SYN2006PUB ======
  
 <WRAP round tip 70%> <WRAP round tip 70%>
Line 36: Line 42:
 </WRAP> </WRAP>
  
- +====== Related links ======
-====== See also ======+
 <WRAP round box 49%> <WRAP round box 49%>
 [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]] [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]]
 </WRAP> </WRAP>
 +