Skrýt
Nastavení

FicTree: a manually annotated treebank of Czech fiction

The FicTree treebank is a syntactically annotated corpus of Czech fiction. It consists of 135,000 words (166,000 tokens). The lemmatization, the morphological and syntactic annotation were performed manually.

Name FicTree
Number of tokens 166 432
Number of tokens (excl. punctuation) 134 637
Number of word forms 29 914
Number of lemmas 13 668
Number of sentences 12 760
Publication date 2017

The composition of the FicTree treebank

The FicTree treebank consists of eight literary works published in the Czech Republic between 1991 and 2007. The texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. Most of the texts were first published between 1991 and 2007 except for one text, published in 1969. Five texts (80% of all tokens) are original Czech texts, the other three are translations (from German and Slovak).

The syntactic annotation of the treebank

The FicTree treebank was syntactically annotated according to the guidelines for the analytical layer of the Prague Dependency Treebank – PDT (PDT 2.0 with revisions 2.5 and 3.0). The corpus was parsed using two parsers: (MST Parser and MaltParser) trained on the PDT a-layer train data. The results were manually corrected by annotators, then merged. Any differences between the two versions were resolved manually by another annotator.

Access to the treebank

The FicTree treebank can be accessed in several ways:

  1. A CNC corpus in the KonText interface: FicTree is available as a CNC corpus in the KonText interface.
  2. Data annotated according to PDT a-layer: the data of the FicTree treebank annotated according to the PDT a-layer guidelines are available for download from the LINDAT/CLARIN repository for non-commercial use.
  3. Data annotated in the Universal Dependencies standard: the data of the FicTree treebank annotated according to the Universal Dependencies standard into which it was automatically converted are available through the UD web page (for non-commercial use only).

1. A CNC corpus in the KonText interface

The FicTree corpus is available in the same way as other CNC corpora through the KonText interface.

The corpus annotation is accessible through a wide range of attributes for each token. The morphological annotation and lemmatization are available using the attributes tag and lemma; additionally, the information about the POS and nominal case (if applicable) of all tokens is accessible using the attributes pos and case.

The syntactic annotation of FicTree can be accessed using several positional attributes (the same as in the SYN2015 corpus):

  • afun – syntactic function according to the a-layer PDT annotation
  • parent – relative position of the governing token
  • eparent – relative position of the nearest governing content word
  • prep – lemma of a preposition governing the token (if any)
  • p_lemma, p_tag, ep_lemma, ep_tag – tag and lemma of the governing token
  • p_pos, p_case, ep_pos, ep_case – POS and case of the governing token
  • p_afun, ep_afun – syntactic function of the governing token

2. Data annotated according to PDT a-layer

The data of the FicTree treebank, annotated according to the PDT a-layer guidelines, are available through the LINDAT/CLARIN repository in a vertical format (tab-separated values), sentence boundaries are marked with empty lines. Each word is written on a single line, followed by five attributes separated by tabulators: lemma, tag, ID (number indicating the position of the token in the sentence), head (ID of the governing token) and afun (syntactic function according to PDT). The texts are divided into segments of maximum 100 tokens (respecting sentence boundaries), the segments are randomly shuffled. Every such segment constitues a single file, the name of the file refers to the literary work, its prefix indicates the intended partition of the data into train, dev and test data (80% - 10% - 10%).

3. Data annotated in the Universal Dependencies standard

The morphological and syntactic annotation according to the UD guidelines was performed by converting the original PDT annotation. The conversion procedure was designed by Dan Zeman and implemented in Treex. The data are available on the Universal Dependencies webpage. They are in the CONLL-U format. The original texts are divided into segments of maximum 100 tokens, the segments are shuffled and divided into a train, val and test data set. The FicTree treebank in UD standard is also accessible using the query tool PML-TQ.

Acknowledgments

We wish to thank the human annotators: Ivana Klímová, Alena Kropíková and Olga Zitová; as well as Dan Zeman for the data conversion.

How to cite FicTree

Jelínek, T. – Hnátková, M. – Skoumalová, H.: FicTree: manuálně syntakticky anotovaný korpus české beletrie. Ústav Českého národního korpusu FF UK, Praha 2017. Dostupný z WWW: http://www.korpus.cz

Jelínek, T.: FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf