The Etalon corpus is a synchronic morphologically annotated corpus of written Czech. The morphological tagging was performed manually, following the same principles as in SYN2020.
Name | Etalon | ||
---|---|---|---|
Position | Number of tokens |
2 265 762
|
|
Number of positions without punctuation | 1 885 621 | ||
Structures | Number of documents <doc> | 94 | |
Number of sentences <s> | 153 774 | ||
Corpus composition | Fiction | tokens | 538 219 |
words | 436 548 | ||
sentences | 38 919 | ||
Professional literature | tokens | 912 194 | |
words | 758 227 | ||
sentences | 60 098 | ||
Journalism | tokens | 815 349 | |
words | 690 846 | ||
sentences | 54 757 | ||
Year of publication | 2021 |
Corpus containing 2 265 762 words, including punctuation, should serve two main purposes:
The Etalon corpus is composed of journalistic, professional and fiction texts. Most of the texts come from the SYN2010 corpus. Journalistic (36%) and professional (40%) texts predominate, but fiction (24%) is also significantly represented (see the table). Some texts are not complete in the corpus, as they would cause an imbalance of styles. Other texts were edited, because the main goal of our work was to obtain standard texts. We corrected obvious typos, as well as sentences where text was broken and rearranged during computer processing.
The Etalon corpus is segmented, lemmatized, and morphologically annotated in the same way as SYN2020: the corpus contains attributes word, synword, lemma, sublemma, tag and verbtag.
The Etalon corpus is accessible in two ways:
I would like to thank all the annotators who contributed to the creation of Etalon. There were many of them and I can't list all of them here. But I would also like to thank my colleagues from ÚTKL, especially Milena Hnátková, Vladimír Petkevič and Tomáš Jelínek for their help in testing and detecting errors.
Skoumalová, H.: Etalon: manually annotated synchronic corpus of Czech texts . Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2021. Available from WWW: http://www.korpus.cz