Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:fictree [2017/12/12 10:44] – michalkren | en:cnk:fictree [2017/12/18 19:24] (current) – [How to cite FicTree] michalkren |
---|
===== FicTree - a manually annotated treebank of Czech fiction ===== | ===== FicTree: a manually annotated treebank of Czech fiction ===== |
| |
| The FicTree treebank is a syntactically annotated corpus of Czech fiction. It consists of 135,000 words (166,000 tokens). The lemmatization, the morphological and syntactic annotation were performed manually. |
<WRAP right 35%> | <WRAP right 35%> |
^ <fs medium>Name</fs> ^^ <fs medium>FicTree</fs> ^ | ^ <fs medium>Name</fs> ^^ <fs medium>FicTree</fs> ^ |
^ ::: ^ Publication date | 2017 | | ^ ::: ^ Publication date | 2017 | |
</WRAP> | </WRAP> |
FicTree is a syntactically annotated corpus of Czech fiction. It consists of 12,760 sentences (166,432 tokens). | ===== The composition of the FicTree treebank ===== |
The texts come from eight literary works published in the Czech Republic between 1991 and 2007. | |
The text data was manually annotated according to the [[https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Prague Dependency Treebank guidelines]] (annotation on the analytical layer). The texts are shuffled into random chunks of maximum 100 words (respecting sentence boundaries). | |
| |
=== Annotation procedure === | The FicTree treebank consists of eight literary works published in the Czech Republic between 1991 and 2007. The texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. |
The texts were parsed independently by two parsers trained on the Prague Dependency Treebank data (analytical layer). The parsing results were manually | Most of the texts were first published between 1991 and 2007 except for one text, published in 1969. |
corrected and the two versions merged. Any differences were resolved manually. | Five texts (80% of all tokens) are original Czech texts, the other three are translations (from German and Slovak). |
| |
=== Text details === | ===== The syntactic annotation of the treebank ===== |
| |
The eight texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. | The FicTree treebank was syntactically annotated according to the guidelines for the analytical layer of the Prague Dependency Treebank – PDT ([[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT 2.0]] with revisions [[http://ufal.mff.cuni.cz/pdt2.5/cs/documentation.html|2.5]] and [[http://ufal.mff.cuni.cz/pdt3.0|3.0]]). The corpus was parsed using two parsers: ([[https://sourceforge.net/projects/mstparser/|MST Parser]] and [[http://www.maltparser.org/|MaltParser]]) trained on the PDT a-layer train data. The results were manually corrected by annotators, then merged. Any differences between the two versions were resolved manually by another annotator. |
Most of the texts were first published between 1991 and 2007 except one text, published in 1969. | |
80% of the texts are original Czech texts, 20% are translations (from German and Slovak). | |
| |
=== References === | ===== Access to the treebank ===== |
Tomáš Jelínek, 2017. //FicTree: a Manually Annotated Treebank of Czech Fiction//. | |
In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf | |
| |
=== Acknowledgments === | The FicTree treebank can be accessed in several ways: |
| - [[en:cnk:fictree#a_cnc_corpus_in_the_kontext_interface|A CNC corpus in the KonText interface]]: FicTree is available as a [[en:cnk:uvod|CNC corpus]] in the [[en:manualy:kontext:index|KonText]] interface. |
| - [[en:cnk:fictree#data_annotated_according_to_pdt_a-layer|Data annotated according to PDT a-layer]]: the data of the FicTree treebank annotated according to the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT a-layer guidelines]] are available for download from the [[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2517|LINDAT/CLARIN]] repository for non-commercial use. |
| - [[en:cnk:fictree#data_annotated_in_the_Universal_Dependencies_standard|Data annotated in the Universal Dependencies standard]]: the data of the FicTree treebank annotated according to the [[http://universaldependencies.org/|Universal Dependencies]] standard into which it was automatically converted are available through the [[http://universaldependencies.org/treebanks/cs_fictree/index.html|UD web page]] (for non-commercial use only). |
| |
We wish to thank the participants in the annotation effort, including Milena Hnátková, Ivana Klímová, Alena Kropíková, Hana Skoumalová and Olga Zitová. | ===== 1. A CNC corpus in the KonText interface ===== |
| |
| The FicTree corpus is available in the same way as other CNC corpora through the [[en:manualy:kontext:index|KonText]] interface. |
| |
| The corpus annotation is accessible through a wide range of attributes for each token. The morphological annotation and lemmatization are available using the attributes [[seznamy:tagy|tag]] and [[en:pojmy:lemma|lemma]]; additionally, the information about the POS and nominal case (if applicable) of all tokens is accessible using the attributes **pos** and **case**. |
| |
| The syntactic annotation of FicTree can be accessed using several positional attributes (the same as in the SYN2015 corpus): |
| * afun – syntactic function according to the a-layer PDT annotation |
| * parent – relative position of the governing token |
| * eparent – relative position of the nearest governing content word |
| * prep – lemma of a preposition governing the token (if any) |
| * p_lemma, p_tag, ep_lemma, ep_tag – tag and lemma of the governing token |
| * p_pos, p_case, ep_pos, ep_case – POS and case of the governing token |
| * p_afun, ep_afun – syntactic function of the governing token |
| |
| ===== 2. Data annotated according to PDT a-layer ===== |
| |
| The data of the FicTree treebank, annotated according to the PDT a-layer guidelines, are available through the |
| [[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2517|LINDAT/CLARIN]] repository in a ''vertical'' format (tab-separated values), sentence boundaries are marked with empty lines. Each word is written on a single line, followed by five attributes separated by tabulators: **lemma**, **tag**, **ID** (number indicating the position of the token in the sentence), **head** (ID of the governing token) and **afun** (syntactic function according to [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT]]). The texts are divided into segments of maximum 100 tokens (respecting sentence boundaries), the segments are randomly shuffled. Every such segment constitues a single file, the name of the file refers to the literary work, its prefix indicates the intended partition of the data into train, dev and test data (80% - 10% - 10%). |
| |
| ===== 3. Data annotated in the Universal Dependencies standard ===== |
| |
| The morphological and syntactic annotation according to the UD guidelines was performed by converting the original PDT annotation. The conversion procedure was designed by Dan Zeman and implemented in [[https://github.com/ufal/treex|Treex]]. |
| The data are available on the [[http://universaldependencies.org/treebanks/cs_fictree/index.html|Universal Dependencies]] webpage. They are in the [[http://universaldependencies.org/format.html|CONLL-U format]]. The original texts are divided into segments of maximum 100 tokens, the segments are shuffled and divided into a train, val and test data set. The FicTree treebank in UD standard is also accessible using the query tool [[https://lindat.mff.cuni.cz/services/pmltq/|PML-TQ]]. |
| |
| ===== Acknowledgments ===== |
| We wish to thank the human annotators: Ivana Klímová, Alena Kropíková and Olga Zitová; as well as Dan Zeman for the data conversion. |
| |
| ===== How to cite FicTree ===== |
| <WRAP round tip 70%> |
| Jelínek, T. – Hnátková, M. – Skoumalová, H.: //FicTree: manuálně syntakticky anotovaný korpus české beletrie//. Ústav Českého národního korpusu FF UK, Praha 2017. Dostupný z WWW: http://www.korpus.cz |
| |
| Jelínek, T.: FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): //ITAT 2017 Proceedings//, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf |
| </WRAP> |
| |