AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn2000 [2015/10/22 20:38] – + graphs, statistics table, citing, see also + mention of FSC2000 vaclavhorkyen:cnk:syn2000 [2016/12/11 12:38] (current) – [Changes in the SYN series corpora] veronikapojarova
Line 7: Line 7:
 ^ ::: ^ Number of positions (tokens) without punctuation | 100 061 381 |   ^ ::: ^ Number of positions (tokens) without punctuation | 100 061 381 |  
 ^ ::: ^ Number of word forms (words) | 1 763 813 |   ^ ::: ^ Number of word forms (words) | 1 763 813 |  
-^ ::: ^ Number of lemmata | 891 713 |+^ ::: ^ Number of lemmas | 891 713 |
 ^ Structural attributes ^ Number of documents (not opera) | 233 797 | ^ Structural attributes ^ Number of documents (not opera) | 233 797 |
 ^ ::: ^ Number of sentences | 7 639 321 | ^ ::: ^ Number of sentences | 7 639 321 |
Line 15: Line 15:
 </WRAP> </WRAP>
  
-[{{:en:cnk:syn2000-slozeni.gif|Structure of corpus SYN2010<fc #3333ff>60 % journalism</fc><fc #cc0000>25 % technical literature</fc>, <fc #ffcc00>15 % fiction</fc>}}]+The corpus SYN2000, published in October 2000, contains 100 million words and is composed of complete texts onlyThe criteria for selecting texts were based on researches of written languagethey were to cover the widest possible genre stratification of the Czech language. The SYN2000 is a synchronic corpuswhich means that it covers contemporary Czech. Therefore it contains primarily texts that were created in 1990–1999. However, also important works of Czech literature were included in the corpus (i.e. Karel Čapek'//Krakatit// or Josef Škvorecký's //Zbabělci// (The Cowards)). As to older texts, there has been a rule that authors had to be born after 1880 for the text to be included in this corpus.
  
-The corpus SYN2000 contains 100 million words and is composed of complete texts only. The criteria for selecting texts were based on researches of written language: they were to cover the widest possible genre stratification of the Czech language. The SYN2000 is a synchronic corpus, which means that it covers contemporary Czech. Therefore it contains primarily texts that were created in 1990–1999. However, also important works of Czech literature were included in the corpus (i.e. Karel Čapek's //Krakatit// or Josef Škvorecký's //Zbabělci// (The Cowards)). As to older textsthere has been a rule that authors had to be born after 1880 for the text to be included in this corpus.+An inspiration for the SYN2000 corpus was the [[wp>British_National_Corpus|British National Corpus]]however work on the BNC ceased in 1994
  
-The SYN2000 corpus is lemmatized and morphologically tagged. That means that for each word (that is the occurrence of the word in the text) its morphological tag, which shows its grammatical categories (the part of speech, number, case etc.) and so-called lemma, which is the basic form of the word (for instance, in case of nouns, it is the nominative singular, for verbs it is the infinitive) can be viewed. Besides these, you can view the code, which identifies the text, in which the searched word occurred.+The SYN2000 corpus is [[en:pojmy:lemma|lemmatized]] and [[en:pojmy:tag|morphologically tagged]]. That means that for each word (that is the occurrence of the word in the text) its morphological tag, which shows its grammatical categories (the part of speech, number, case etc.) and so-called lemma, which is the basic form of the word (for instance, in case of nouns, it is the nominative singular, for verbs it is the infinitive) can be viewed. Besides these, you can view the code, which identifies the text, in which the searched word occurred.
  
 The [[en:cnk:FSC2000]] corpus is a modified version of the SYN2000 with enhanced lemmatization, which was used as a source for the Frequency Dictionary of Czech. The [[en:cnk:FSC2000]] corpus is a modified version of the SYN2000 with enhanced lemmatization, which was used as a source for the Frequency Dictionary of Czech.
 +===== Changes in the SYN series corpora =====
 +Please take note of the vital changes in the composition and processing between the SYN2000 and [[en:cnk:SYN2005|SYN2005]] corpora (and also SYN2000 and [[en:cnk:SYN2010|SYN2010]]), which are summarized on the pages relating to the corpus [[en:cnk:syn2005#zmeny_oproti_korpusu_syn2000|SYN2005]]. A consequence of these changes is a difference in the data relating to frequency.
 +Changes in the concept of the corpus’s [[en:pojmy:reprezentativnost|representativeness]], resulting in [[en:cnk:syn2005#novy_pristup_k_reprezentativnosti_slozeni_korpusu|vital differences]] in the composition compared to other corpora from the SYN series, can be observed in the following table which compares the text-type composition of the SYN2000 a SYN2005 corpora.
  
-<WRAP clear></WRAP>+| ^ SYN2005 ^ SYN2000 ^ 
 +^ fiction | 40 %| 15 %| 
 +^ non-fiction | 27 %| 25 %| 
 +^ journalism | 33 % | 60 % |
  
-[{{:en:cnk:syn2000-slozeni-odborna-en.gif|Structure of technical and other specialized literature according to thematic orientation (no. of words  in mil.)}}]+==== Composition of the SYN2000 corpus ====
  
-[{{:en:cnk:syn2000-slozeni-publicistika-roky.gif|Structure of journalism according to the year of issue (no. of words  in mil.)}}]+[{{:en:cnk:syn2000-slozeni.gif|Structure of corpus SYN2010: <fc #3333ff>60 % journalism</fc>, <fc #cc0000>25 % technical literature</fc>, <fc #ffcc00>15 % fiction</fc>}}]
  
 +<WRAP clear></WRAP>
 +[{{:en:cnk:syn2000-slozeni-odborna-en.gif|Structure of technical and other specialized literature according to thematic orientation (no. of words  in mil.)}}]
 +[{{:en:cnk:syn2000-slozeni-publicistika-roky.gif|Structure of journalism according to the year of issue (no. of words  in mil.)}}]
 [{{:en:cnk:syn2000-slozeni-publicistika-tituly.gif|Structure of journalism according to the newspaper title (no. of words  in mil.)}}] [{{:en:cnk:syn2000-slozeni-publicistika-tituly.gif|Structure of journalism according to the newspaper title (no. of words  in mil.)}}]
 +<WRAP clear></WRAP>
  
  
 +===== Structure of the SYN2000 corpus =====
  
-===== Citing SYN2000 =====+Among the [[en:pojmy:atributy_strukturni|structural units]] used in this corpus are ''<doc>'' and ''<s>''; document and sentence – followed by each individual [[en:pojmy:atributy_strukturni#pozice_jako_strukturni_jednotka|position]]. 
 +They can be displayed using the menu item [[en:manualy:kontext:moznosti_zobrazeni|View options]]. 
 +The newer corpora of the SYN series have an additional, higher structure labelled ''<opus>'' (this difference is essential for example when searching with the help of the ''within'' condition). 
 + 
 +[{{ :cnk:strukturni_znacky_syn2000.jpg?300 |Structural units in the SYN2000 corpus.}}] 
 + 
 +===== How to cite SYN2000 =====
  
 <WRAP round tip 70%> <WRAP round tip 70%>
Line 39: Line 56:
 </WRAP> </WRAP>
  
-===== See also =====+===== Related links=====
  
 <WRAP round box 49%> <WRAP round box 49%>
-[[en:cnk:syn|SYN]] • [[en:cnk:FSC2000]] • [[en:cnk:SYN2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:SYN2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]]+[[en:cnk:syn|SYN]] • [[en:cnk:FSC2000]] • [[en:cnk:SYN2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:SYN2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]] • [[en:cnk:SYN2015|SYN2015]]
 </WRAP> </WRAP>
 +
 +