The FicTree treebank is a syntactically annotated corpus of Czech fiction. It consists of 135,000 words (166,000 tokens). The lemmatization, the morphological and syntactic annotation were performed manually.
Name | FicTree | |
---|---|---|
Number of tokens | 166 432 | |
Number of tokens (excl. punctuation) | 134 637 | |
Number of word forms | 29 914 | |
Number of lemmas | 13 668 | |
Number of sentences | 12 760 | |
Publication date | 2017 |
The FicTree treebank consists of eight literary works published in the Czech Republic between 1991 and 2007. The texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. Most of the texts were first published between 1991 and 2007 except for one text, published in 1969. Five texts (80% of all tokens) are original Czech texts, the other three are translations (from German and Slovak).
The FicTree treebank was syntactically annotated according to the guidelines for the analytical layer of the Prague Dependency Treebank – PDT (PDT 2.0 with revisions 2.5 and 3.0). The corpus was parsed using two parsers: (MST Parser and MaltParser) trained on the PDT a-layer train data. The results were manually corrected by annotators, then merged. Any differences between the two versions were resolved manually by another annotator.
The FicTree treebank can be accessed in several ways:
The FicTree corpus is available in the same way as other CNC corpora through the KonText interface.
The corpus annotation is accessible through a wide range of attributes for each token. The morphological annotation and lemmatization are available using the attributes tag and lemma; additionally, the information about the POS and nominal case (if applicable) of all tokens is accessible using the attributes pos and case.
The syntactic annotation of FicTree can be accessed using several positional attributes (the same as in the SYN2015 corpus):
The data of the FicTree treebank, annotated according to the PDT a-layer guidelines, are available through the
LINDAT/CLARIN repository in a vertical
format (tab-separated values), sentence boundaries are marked with empty lines. Each word is written on a single line, followed by five attributes separated by tabulators: lemma, tag, ID (number indicating the position of the token in the sentence), head (ID of the governing token) and afun (syntactic function according to PDT). The texts are divided into segments of maximum 100 tokens (respecting sentence boundaries), the segments are randomly shuffled. Every such segment constitues a single file, the name of the file refers to the literary work, its prefix indicates the intended partition of the data into train, dev and test data (80% - 10% - 10%).
The morphological and syntactic annotation according to the UD guidelines was performed by converting the original PDT annotation. The conversion procedure was designed by Dan Zeman and implemented in Treex. The data are available on the Universal Dependencies webpage. They are in the CONLL-U format. The original texts are divided into segments of maximum 100 tokens, the segments are shuffled and divided into a train, val and test data set. The FicTree treebank in UD standard is also accessible using the query tool PML-TQ.
We wish to thank the human annotators: Ivana Klímová, Alena Kropíková and Olga Zitová; as well as Dan Zeman for the data conversion.
Jelínek, T. – Hnátková, M. – Skoumalová, H.: FicTree: manuálně syntakticky anotovaný korpus české beletrie. Ústav Českého národního korpusu FF UK, Praha 2017. Dostupný z WWW: http://www.korpus.cz
Jelínek, T.: FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf