en:cnk:fictree - Příručka ČNK

This is an old revision of the document!

Manually annotated treebank of Czech fiction

Name		FicTree
	Number of tokens	166 432
	Number of tokens (excl. punctuation)	134 637
	Number of word forms	29 914
	Number of lemmas	13 668
	Number of sentences	12 760
	Publication date	2017

FicTree is a syntactically annotated corpus of Czech fiction. It consists of 12,760 sentences (166,432 tokens). The texts come from eight literary works published in the Czech Republic between 1991 and 2007. The text data was manually annotated according to the Prague Dependency Treebank guidelines (annotation on the analytical layer). The texts are shuffled into random chunks of maximum 100 words (respecting sentence boundaries).

Annotation procedure

The texts were parsed independently by two parsers trained on the Prague Dependency Treebank data (analytical layer). The parsing results were manually corrected and the two versions merged. Any differences were resolved manually.

Text details

The eight texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. Most of the texts were first published between 1991 and 2007 except one text, published in 1969. 80% of the texts are original Czech texts, 20% are translations (from German and Slovak).

References

Tomáš Jelínek, 2017. FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf

Acknowledgments

We wish to thank the participants in the annotation effort, including Milena Hnátková, Ivana Klímová, Alena Kropíková, Hana Skoumalová and Olga Zitová.

Trace: • fictree

Manually annotated treebank of Czech fiction

Annotation procedure

Text details

References

Acknowledgments

Search

Navigation

Print/export

Tools

Languages

Licence