Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:cnk:fictree [2017/11/14 09:22] – tomasjelinek | en:cnk:fictree [2017/12/18 19:24] (current) – [How to cite FicTree] michalkren | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ===== FicTree | + | ===== FicTree: a manually annotated |
| + | The FicTree treebank is a syntactically annotated corpus of Czech fiction. It consists of 135,000 words (166,000 tokens). | ||
| <WRAP right 35%> | <WRAP right 35%> | ||
| ^ <fs medium> | ^ <fs medium> | ||
| Line 11: | Line 11: | ||
| ^ ::: ^ Publication date | 2017 | | ^ ::: ^ Publication date | 2017 | | ||
| </ | </ | ||
| - | FicTree is a syntactically annotated corpus | + | ===== The composition |
| - | The texts come from eight literary works published in the Czech Republic between 1991 and 2007. | + | |
| - | The text data was manually annotated according to the [[https:// | + | |
| - | guidelines]] (annotation on the analytical layer). | + | |
| - | To comply with agreements concluded with the copyright holders, the texts are shuffled into random chunks of maximum 100 words (respecting sentence boundaries). | + | |
| - | === Annotation procedure === | + | The FicTree treebank consists of eight literary works published in the Czech Republic between 1991 and 2007. The texts in the treebank include six fiction titles, a children’s fiction |
| - | The texts were parsed independently by two parsers trained on the Prague Dependency Treebank data (analytical layer). The parsing results were manually | + | Most of the texts were first published between 1991 and 2007 except for one text, published in 1969. |
| - | corrected and the two versions merged. Any differences | + | Five texts (80% of all tokens) are original Czech texts, the other three are translations (from German and Slovak). |
| - | === Text details | + | ===== The syntactic annotation of the treebank ===== |
| - | The eight texts in the treebank | + | The FicTree |
| - | Most of the texts were first published between 1991 and 2007 except one text, published in 1969. | + | |
| - | 80% of the texts are original Czech texts, 20% are translations (from German and Slovak). | + | |
| - | === References | + | ===== Access to the treebank ===== |
| - | Tomáš Jelínek, 2017. //FicTree: a Manually Annotated Treebank of Czech Fiction// | + | |
| - | In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, | + | |
| - | http:// | + | |
| - | === Acknowledgments === | + | The FicTree treebank can be accessed in several ways: |
| + | - [[en: | ||
| + | - [[en: | ||
| + | - [[en: | ||
| - | We wish to thank the participants | + | ===== 1. A CNC corpus |
| - | Olga Zitová. | + | |
| + | The FicTree corpus is available in the same way as other CNC corpora through the [[en: | ||
| + | The corpus annotation is accessible through a wide range of attributes for each token. The morphological annotation and lemmatization are available using the attributes [[seznamy: | ||
| + | |||
| + | The syntactic annotation of FicTree can be accessed using several positional attributes (the same as in the SYN2015 corpus): | ||
| + | * afun – syntactic function according to the a-layer PDT annotation | ||
| + | * parent – relative position of the governing token | ||
| + | * eparent – relative position of the nearest governing content word | ||
| + | * prep – lemma of a preposition governing the token (if any) | ||
| + | * p_lemma, p_tag, ep_lemma, ep_tag – tag and lemma of the governing token | ||
| + | * p_pos, p_case, ep_pos, ep_case – POS and case of the governing token | ||
| + | * p_afun, ep_afun – syntactic function of the governing token | ||
| + | |||
| + | ===== 2. Data annotated according to PDT a-layer ===== | ||
| + | |||
| + | The data of the FicTree treebank, annotated according to the PDT a-layer guidelines, are available through the | ||
| + | [[https:// | ||
| + | |||
| + | ===== 3. Data annotated in the Universal Dependencies standard ===== | ||
| + | |||
| + | The morphological and syntactic annotation according to the UD guidelines was performed by converting the original PDT annotation. The conversion procedure was designed by Dan Zeman and implemented in [[https:// | ||
| + | The data are available on the [[http:// | ||
| + | |||
| + | ===== Acknowledgments ===== | ||
| + | We wish to thank the human annotators: Ivana Klímová, Alena Kropíková and Olga Zitová; as well as Dan Zeman for the data conversion. | ||
| + | |||
| + | ===== How to cite FicTree ===== | ||
| + | <WRAP round tip 70%> | ||
| + | Jelínek, T. – Hnátková, M. – Skoumalová, | ||
| + | |||
| + | Jelínek, T.: FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): //ITAT 2017 Proceedings//, | ||
| + | </ | ||