AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
en:cnk:syn2005 [2015/01/22 11:34] – vytvořeno vaclavcvrceken:cnk:syn2005 [2016/12/11 12:11] (current) veronikapojarova
Line 1: Line 1:
-====== SYN2005 ======+~~NOTOC~~ 
 +====== Corpus SYN2005 ======
  
-The SYN2005 corpus is a synchronic representative corpus of contemporary written Czech, containing 100 million words (tokens). This basic characteristic is identical with its predecessor, the  [[SYN2000]] corpus. There are, however, also many differences between these two corpora, which must be taken into consideration when comparing any data in the two corpora (see below), because the mere mechanical comparison of frequencies can lead to misleading conclusions when these circumstances are not known. We also consider it important to emphasise that none of the corpus SYN2005 texts were previously used in the SYN2000 corpus.+The SYN2005 corpus is a synchronic representative corpus of contemporary written Czech, containing 100 million words (tokens). This basic characteristic is identical with its predecessor, the [[SYN2000]] corpus. There are, however, also many differences between these two corpora, which must be taken into consideration when comparing any data in the two corpora (see below), because the mere mechanical comparison of frequencies can lead to misleading conclusions when these circumstances are not known. We also consider it important to emphasize that none of the corpus SYN2005 texts were previously used in the SYN2000 corpus; both corpora are therefore disjunctive as to the texts used and they contain altogether 200 million words (tokens).
  
-The representative character of the SYN2005 corpus is based on a new research of written language reception, therefore, its structure is different in some respects from the SYN2000 corpus. The comparison of both corpora according to main fields are presented in the following chart:+<WRAP right 35%> 
 +^ <fs medium>Name</fs> ^^ <fs medium>SYN2005</fs> ^ 
 +^ Positions ^ Number of positions (tokens) | 122 419 382 |   
 +^ ::: ^ Number of positions (tokens) without punctuation | 101 355 116 |   
 +^ ::: ^ Number of word forms (words) | 1 778 142 |   
 +^ ::: ^ Number of lemmata | 825 142 | 
 +^ Structural attributes ^ Number of opera | 2 382 | 
 +^ ::: ^ Number of documents | 132 353 | 
 +^ ::: ^ Number of sentences | 7 945 998 | 
 +^ Further information ^ Reference | YES  |   
 +^ ::: ^ Representative | YES |    
 +^ ::^ Publication date | 2005  | 
 +</WRAP>
  
 +====== Changes compared to the SYN2000 corpus ======
 +
 +==== A new approach to representativeness – the composition of the corpus ====
 +
 +The representative character of the SYN2005 corpus is based on a new research of written language reception, therefore, its structure is different in some respects from the SYN2000 corpus. The comparison of both corpora according to main fields are presented in the following chart:
  
 ^ ^ SYN2005 ^ SYN2000 ^ ^ ^ SYN2005 ^ SYN2000 ^
Line 11: Line 29:
 | journalism |  33 % |  60 % | | journalism |  33 % |  60 % |
  
-More differences are visible also within the main fields; while the structure of technical literature as to its thematic orientation has only slightly changed, the structure of journalism, on the other hand, has changed considerably. Not only are all the journalistic texts from 2000–2004, and each year has the same representation, but in comparison with the SYN2000 corpus, also the representation of the individual newspapers and magazines has changed - more new titles were added and among them particularly the share of //Blesk// is important. What has not changed is the synchrony in the other two main fields; in the SYN2005 corpus, we can find technical literature from 1990–2004, the fiction may beeven older. In both cases, however, special attention was paid to making sure that there were as few older texts as possible.+More differences are visible also within the main fields; while the structure of technical literature as to its thematic orientation has only slightly changed, the structure of journalism, on the other hand, has changed considerably. Not only are all the journalistic texts from 2000–2004, and each year has the same representation, but in comparison with the SYN2000 corpus, also the representation of the individual newspapers and magazines has changed -- more new titles were added and among them particularly the share of //Blesk// is important. What has not changed is the synchrony in the other two main fields; in the SYN2005 corpus, we can find technical literature from 1990–2004, the fiction may beeven older. In both cases, however, special attention was paid to making sure that there were as few older texts as possible. 
 +<WRAP clear></WRAP> 
 +[{{:en:cnk:syn2010-slozeni.gif|Structure of corpus SYN2005: <fc #ffcc00>40 % fiction</fc>, <fc #cc0000>27 % technical literature</fc>, <fc #3333ff>33 % journalism</fc>}}] 
 +<WRAP clear></WRAP> 
 + 
 +[{{:en:cnk:syn2005-slozeni-odborna-en.gif|Structure of technical and other specialized literature according to thematic orientation (no. of words  in mil.)}}] 
 +[{{:en:cnk:syn2005-slozeni-publicistika-roky.gif|Structure of journalism according to the year of issue (no. of words  in mil.)}}] 
 +[{{:en:cnk:syn2005-slozeni-publicistika-tituly-en.gif|Structure of journalism according to the newspaper title (no. of words  in mil.)}}] 
 +<WRAP clear></WRAP> 
 + 
 +==== What differences between corpora cause ==== 
 + 
 +Differences between corpora may cause for instance a salient **rise in frequency for a certain word**. This need not necessarily be caused by the newer corpus, but only by **a higher ratio of fiction** contained in it; the distribution of the word in written language may not have changed at all. 
 + 
 +==== New lemmatization and morphological annotation ==== 
 +Since June 2006, the SYN2005 corpus has been lemmatized and morphologically tagged. A more advanced version of lemmatization and morphological tagging was used for this corpus. The system of morphological tags remains the same, however, and the only novelty is position no. 16 which expresses the verbal aspect. 
 + 
 +Also new and improved tokenization (division of the corpus into words) and segmentation (division into sentences) is connected with the new lemmatization and morphological tagging of the SYN2005 corpus. For instance, the word //česko-polský// was divided into three tokens (//česko - polský//) in SYN2000, while in SYN2005, it is only one token (//česko-polský//). 
 +==== Clear bibliographical information ==== 
 +The last major change concerns source determining; many users of the SYN2000 corpus justifiably criticized the fact that they had to look for the bibliographic information on our web site under a code. In the SYN2005 corpus, all relevant information about the text (author, name, publisher, year of publishing, etc.) is available through the corpus manager. 
 + 
 +===== Structure of the SYN2005 corpus ===== 
 +Among the [[en:pojmy:atributy_strukturni|structural units]] used in this corpus are ''<opus>'', ''<doc>'' and ''<s>''; the text, document and sentence – followed by each individual [[en:pojmy:atributy_strukturni#pozice_jako_strukturni_jednotka|position]]. 
 +They can be displayed using the menu item [[en:manualy:kontext:moznosti_zobrazeni|View options]]. 
 + 
 +[{{ :cnk:struktur_znacky.jpg?300 |Structural units in the SYN2005 corpus.}}] 
 + 
 + 
 +===== How to cite SYN2005 ===== 
 + 
 +<WRAP round tip 75%> 
 +Čermák, F. – Doležalová-Spoustová, D. – Hlaváčová, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kopřivová, M. – Křen, M. – Novotná, R. – Petkevič, V. – Schmiedtová, V. – Skoumalová, H. – Šulc, M. – Velíšek, Z.: //SYN2005: žánrově vyvážený korpus psané češtiny//. Ústav Českého národního korpusu FF UK, Praha 2005. Available on-line: http://www.korpus.cz 
 +</WRAP> 
 + 
 + 
 +===== Related links ===== 
 + 
 +<WRAP round box 52%> 
 +[[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]] 
 +</WRAP> 
 + 
  
-Since June 2006, the SYN2005 corpus has been lemmatised and morphologically tagged. A more advanced version of lemmatisation and morphological tagging wasused for this corpus. The system of morphological tags remains the same, however, and the only novelty is position no. 16 which expresses the verbal aspect. 
  
-Also new and improved tokenisation (division of the corpus into words) and segmentation (division into sentences) is connected with the new lemmatisation and morphological tagging of the SYN2005 corpus. For instance, the word //česko-polský// was divided into three tokens (//česko - polský//) in SYN2000, while in SYN2005, it is only one token (//česko-polský//). 
  
-The last major change concerns source determining; many users of the SYN2000 corpus justifiably criticised the fact that they had to look for the bibliographic information on our web site under a code. In the SYN2005 corpus, all relevant information about the text (author, name, publisher, year of publishing, etc.) is available through the corpus manager.