AplikaceAplikace
Nastavení

This is an old revision of the document!


FicTree - a manually annotated treebank of Czech fiction

Name FicTree
Number of tokens 166 432
Number of tokens (excl. punctuation) 134 637
Number of word forms 29 914
Number of lemmas 13 668
Number of sentences 12 760
Publication date 2017

FicTree is a syntactically annotated corpus of Czech fiction. It consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The text data was manually annotated according to the Prague Dependency Treebank guidelines (annotation on the analytical layer). The texts are shuffled into random chunks of maximum 100 words (respecting sentence boundaries).

Annotation procedure

The texts were parsed independently by two parsers trained on the Prague Dependency Treebank data (analytical layer). The parsing results were manually corrected and the two versions merged. Any differences were resolved manually.

Text details

The eight texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. Most of the texts were first published between 1991 and 2007 except one text, published in 1969. 80% of the texts are original Czech texts, 20% are translations (from German and Slovak).

References

Tomáš Jelínek, 2017. FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf

Acknowledgments

We wish to thank the participants in the annotation effort, including Milena Hnátková, Ivana Klímová, Alena Kropíková, Hana Skoumalová and Olga Zitová.