Corpus Etalon: manually annotated corpus of Czech texts
The Etalon corpus is a synchronic morphologically annotated corpus of written Czech. The morphological tagging was performed manually, following the same principles as in SYN2020.
Name | Etalon | ||
---|---|---|---|
Position | Number of tokens |
2 265 762
|
|
Number of positions without punctuation | 1 885 621 | ||
Structures | Number of documents <doc> | 94 | |
Number of sentences <s> | 153 774 | ||
Corpus composition | Fiction | tokens | 538 219 |
words | 436 548 | ||
sentences | 38 919 | ||
Professional literature | tokens | 912 194 | |
words | 758 227 | ||
sentences | 60 098 | ||
Journalism | tokens | 815 349 | |
words | 690 846 | ||
sentences | 54 757 | ||
Year of publication | 2021 |
Corpus containing 2 265 762 words, including punctuation, should serve two main purposes:
- As a set of training and testing data for automatic tagging methods, whether they are methods using linguistic rules, or stochastic methods, neural networks, etc.
Etalon corpus composition
The Etalon corpus is composed of journalistic, professional and fiction texts. Most of the texts come from the SYN2010 corpus. Journalistic (36%) and professional (40%) texts predominate, but fiction (24%) is also significantly represented (see the table). Some texts are not complete in the corpus, as they would cause an imbalance of styles. Other texts were edited, because the main goal of our work was to obtain standard texts. We corrected obvious typos, as well as sentences where text was broken and rearranged during computer processing.
Morphological annotation
The Etalon corpus is segmented, lemmatized, and morphologically annotated in the same way as SYN2020: the corpus contains attributes word, synword, lemma, sublemma, tag and verbtag.
Accessing the corpus
The Etalon corpus is accessible in two ways:
- CNK corpus via the Kontext interface.
- Data in vertical form: this data can be downloaded from the LINDAT/CLARIN repository (for non-commercial use). This data is divided into segments of a maximum of 100 words (without punctuation) and the segments are shuffled.
Acknowledgments
I would like to thank all the annotators who contributed to the creation of Etalon. There were many of them and I can't list all of them here. But I would also like to thank my colleagues from ÚTKL, especially Milena Hnátková, Vladimír Petkevič and Tomáš Jelínek for their help in testing and detecting errors.
How to cite Etalon corpus
Skoumalová, H.: Etalon: manually annotated synchronic corpus of Czech texts . Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2021. Available from WWW: http://www.korpus.cz