This is an old revision of the document!
Manually annotated treebank of Czech fiction
Name | FicTree | |
---|---|---|
Number of tokens | 166 432 | |
Number of tokens (excl. punctuation) | 134 637 | |
Number of word forms | 29 914 | |
Number of lemmas | 13 668 | |
Number of sentences | 12 760 | |
Publication date | 2017 |
FicTree is a syntactically annotated corpus of Czech fiction. It consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The text data was manually annotated according to the Prague Dependency Treebank guidelines (annotation on the analytical layer). The texts are shuffled into random chunks of maximum 100 words (respecting sentence boundaries).
Annotation procedure
The texts were parsed independently by two parsers trained on the Prague Dependency Treebank data (analytical layer). The parsing results were manually corrected and the two versions merged. Any differences were resolved manually.
Text details
The eight texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. Most of the texts were first published between 1991 and 2007 except one text, published in 1969. 80% of the texts are original Czech texts, 20% are translations (from German and Slovak).
References
Tomáš Jelínek, 2017. FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf
Acknowledgments
We wish to thank the participants in the annotation effort, including Milena Hnátková, Ivana Klímová, Alena Kropíková, Hana Skoumalová and Olga Zitová.