Corpus Etalon: manually annotated corpus of Czech texts

The Etalon corpus is a synchronic morphologically annotated corpus of written Czech. The morphological tagging was performed manually, following the same principles as in SYN2020.

Name			Etalon
Position	Number of tokens		2 265 762
Position	Number of positions without punctuation		1 885 621
Structures	Number of documents <doc>		94
Structures	Number of sentences <s>		153 774
Corpus composition	Fiction	tokens	538 219
		words	436 548
		sentences	38 919
	Professional literature	tokens	912 194
		words	758 227
		sentences	60 098
	Journalism	tokens	815 349
		words	690 846
		sentences	54 757
Year of publication			2021

Corpus containing 2 265 762 words, including punctuation, should serve two main purposes:

As a standard for the SYN2020 corpus, ie. in case of doubt as to the correctness of sentence segmentation, tokenization or morphological tagging of the corpus SYN2020 or other corpora in the SYN series, it should provide the answer to the question of how the corpus should be annotated.
As a set of training and testing data for automatic tagging methods, whether they are methods using linguistic rules, or stochastic methods, neural networks, etc.

Etalon corpus composition

The Etalon corpus is composed of journalistic, professional and fiction texts. Most of the texts come from the SYN2010 corpus. Journalistic (36%) and professional (40%) texts predominate, but fiction (24%) is also significantly represented (see the table). Some texts are not complete in the corpus, as they would cause an imbalance of styles. Other texts were edited, because the main goal of our work was to obtain standard texts. We corrected obvious typos, as well as sentences where text was broken and rearranged during computer processing.

Morphological annotation

The Etalon corpus is segmented, lemmatized, and morphologically annotated in the same way as SYN2020: the corpus contains attributes word, synword, lemma, sublemma, tag and verbtag.

Accessing the corpus

The Etalon corpus is accessible in two ways:

CNK corpus via the Kontext interface.
Data in vertical form: this data can be downloaded from the LINDAT/CLARIN repository (for non-commercial use). This data is divided into segments of a maximum of 100 words (without punctuation) and the segments are shuffled.

Acknowledgments

I would like to thank all the annotators who contributed to the creation of Etalon. There were many of them and I can't list all of them here. But I would also like to thank my colleagues from ÚTKL, especially Milena Hnátková, Vladimír Petkevič and Tomáš Jelínek for their help in testing and detecting errors.

How to cite Etalon corpus

Skoumalová, H.: Etalon: manually annotated synchronic corpus of Czech texts . Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2021. Available from WWW: http://www.korpus.cz

Trace: • etalon