~~NOTOC~~
====== Corpus Etalon: manually annotated corpus of Czech texts ======
The Etalon corpus is a synchronic morphologically annotated corpus of written Czech. The morphological tagging was performed manually, following the same principles as in [[en:cnk:syn2020|SYN2020]].
^ Name ^^^ Etalon ^
^ Position ^ Number of tokens ^| 2 265 762 |
^ ::: ^ Number of positions without punctuation ^| 1 885 621 |
^ Structures ^ Number of documents ^| 94 |
^ ::: ^ Number of sentences ^| 153 774 |
^ Corpus composition ^ Fiction ^ tokens | 538 219 |
^ ::: ^ ::: ^ words | 436 548 |
^ ::: ^ ::: ^ sentences | 38 919 |
^ ::: ^ Professional literature ^ tokens | 912 194 |
^ ::: ^ ::: ^ words | 758 227 |
^ ::: ^ ::: ^ sentences | 60 098 |
^ ::: ^ Journalism ^ tokens | 815 349 |
^ ::: ^ ::: ^ words | 690 846 |
^ ::: ^ ::: ^ sentences | 54 757 |
^ Year of publication ^^ | 2021 |
Corpus containing 2 265 762 words, including punctuation, should serve two main purposes:
- As a standard for the SYN2020 corpus, ie. in case of doubt as to the correctness of sentence segmentation, tokenization or morphological tagging of the corpus [[en:cnk:syn2020|SYN2020]] or other corpora in the [[en:cnk:syn|SYN]] series, it should provide the answer to the question of how the corpus should be annotated.
- As a set of training and testing data for automatic tagging methods, whether they are methods using linguistic rules, or stochastic methods, neural networks, etc.
===== Etalon corpus composition =====
The Etalon corpus is composed of journalistic, professional and fiction texts. Most of the texts come from the [[en:cnk:syn2010|SYN2010]] corpus. Journalistic (36%) and professional (40%) texts predominate, but fiction (24%) is also significantly represented (see the table). Some texts are not complete in the corpus, as they would cause an imbalance of styles. Other texts were edited, because the main goal of our work was to obtain standard texts. We corrected obvious typos, as well as sentences where text was broken and rearranged during computer processing.
===== Morphological annotation =====
The Etalon corpus is segmented, lemmatized, and morphologically annotated in the same way as [[en:cnk:syn2020#annotation_of_syn2020changes_compared_to_other_corpora_of_the_syn_series|SYN2020]]: the corpus contains attributes [[en:cnk:syn2020#multiple_lemmatization_and_tagging_aggregate|word, synword]], [[en:cnk:syn2020#lemmatization|lemma, sublemma]], [[en:cnk:syn2020#morphological_tagging|tag]] and [[en:cnk:syn2020#verb_tagging_verbtag|verbtag]].
===== Accessing the corpus =====
The Etalon corpus is accessible in two ways:
- CNK corpus via the [[en:manualy:kontext:index|Kontext]] interface.
- Data in vertical form: this data can be downloaded from the [[http://hdl.handle.net/11234/1-3698|LINDAT/CLARIN]] repository (for non-commercial use). This data is divided into segments of a maximum of 100 words (without punctuation) and the segments are shuffled.
===== Acknowledgments =====
I would like to thank all the annotators who contributed to the creation of Etalon. There were many of them and I can't list all of them here. But I would also like to thank my colleagues from [[http://utkl.ff.cuni.cz|ÚTKL]], especially Milena Hnátková, Vladimír Petkevič and Tomáš Jelínek for their help in testing and detecting errors.
===== How to cite Etalon corpus =====
Skoumalová, H.: // Etalon: manually annotated synchronic corpus of Czech texts //. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2021. Available from WWW: http://www.korpus.cz