AplikaceAplikace
Nastavení

Corpus Etalon: manually annotated corpus of Czech texts

The Etalon corpus is a synchronic morphologically annotated corpus of written Czech. The morphological tagging was performed manually, following the same principles as in SYN2020.

Name Etalon
Position Number of tokens
2 265 762
Number of positions without punctuation 1 885 621
Structures Number of documents <doc> 94
Number of sentences <s> 153 774
Corpus composition Fiction tokens 538 219
words 436 548
sentences 38 919
Professional literature tokens 912 194
words 758 227
sentences 60 098
Journalism tokens 815 349
words 690 846
sentences 54 757
Year of publication 2021

Corpus containing 2 265 762 words, including punctuation, should serve two main purposes:

  1. As a standard for the SYN2020 corpus, ie. in case of doubt as to the correctness of sentence segmentation, tokenization or morphological tagging of the corpus SYN2020 or other corpora in the SYN series, it should provide the answer to the question of how the corpus should be annotated.
  2. As a set of training and testing data for automatic tagging methods, whether they are methods using linguistic rules, or stochastic methods, neural networks, etc.

Etalon corpus composition

The Etalon corpus is composed of journalistic, professional and fiction texts. Most of the texts come from the SYN2010 corpus. Journalistic (36%) and professional (40%) texts predominate, but fiction (24%) is also significantly represented (see the table). Some texts are not complete in the corpus, as they would cause an imbalance of styles. Other texts were edited, because the main goal of our work was to obtain standard texts. We corrected obvious typos, as well as sentences where text was broken and rearranged during computer processing.

Morphological annotation

The Etalon corpus is segmented, lemmatized, and morphologically annotated in the same way as SYN2020: the corpus contains attributes word, synword, lemma, sublemma, tag and verbtag.

Accessing the corpus

The Etalon corpus is accessible in two ways:

  1. CNK corpus via the Kontext interface.
  2. Data in vertical form: this data can be downloaded from the LINDAT/CLARIN repository (for non-commercial use). This data is divided into segments of a maximum of 100 words (without punctuation) and the segments are shuffled.

Acknowledgments

I would like to thank all the annotators who contributed to the creation of Etalon. There were many of them and I can't list all of them here. But I would also like to thank my colleagues from ÚTKL, especially Milena Hnátková, Vladimír Petkevič and Tomáš Jelínek for their help in testing and detecting errors.

How to cite Etalon corpus

Skoumalová, H.: Etalon: manually annotated synchronic corpus of Czech texts . Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2021. Available from WWW: http://www.korpus.cz