AplikaceAplikace
Nastavení

This is an old revision of the document!


Corpus SYN2000

Name SYN2000
Positions Number of positions (tokens) 120 908 724
Number of positions (tokens) without punctuation 100 061 381
Number of word forms (words) 1 763 813
Number of lemmata 891 713
Structural attributes Number of documents (not opera) 233 797
Number of sentences 7 639 321
Further information Reference YES
Representative YES
Publication date 2000
Structure of corpus SYN2010: 60 % journalism, 25 % technical literature, 15 % fiction

The corpus SYN2000 contains 100 million words and is composed of complete texts only. The criteria for selecting texts were based on researches of written language: they were to cover the widest possible genre stratification of the Czech language. The SYN2000 is a synchronic corpus, which means that it covers contemporary Czech. Therefore it contains primarily texts that were created in 1990–1999. However, also important works of Czech literature were included in the corpus (i.e. Karel Čapek's Krakatit or Josef Škvorecký's Zbabělci (The Cowards)). As to older texts, there has been a rule that authors had to be born after 1880 for the text to be included in this corpus.

The SYN2000 corpus is lemmatized and morphologically tagged. That means that for each word (that is the occurrence of the word in the text) its morphological tag, which shows its grammatical categories (the part of speech, number, case etc.) and so-called lemma, which is the basic form of the word (for instance, in case of nouns, it is the nominative singular, for verbs it is the infinitive) can be viewed. Besides these, you can view the code, which identifies the text, in which the searched word occurred.

The FSC2000 corpus is a modified version of the SYN2000 with enhanced lemmatization, which was used as a source for the Frequency Dictionary of Czech.

Structure of technical and other specialized literature according to thematic orientation (no. of words in mil.)
Structure of journalism according to the year of issue (no. of words in mil.)
Structure of journalism according to the newspaper title (no. of words in mil.)

Citing SYN2000

Čermák, F. – Blatná, R. – Hlaváčová, J. – Klímová, J. – Kocek, J. – Kopřivová, M. – Křen, M. – Petkevič, V. – Schmiedtová, V. – Šulc, M.: SYN2000: žánrově vyvážený korpus psané češtiny. Ústav Českého národního korpusu FF UK, Praha 2000. Available on-line: http://www.korpus.cz

See also