AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
en:cnk:fictree [2017/11/14 11:30] michalkrenen:cnk:fictree [2017/12/18 19:24] – [How to cite FicTree] michalkren
Line 1: Line 1:
-===== FicTree a treebank of Czech fiction ===== +===== FicTreemanually annotated treebank of Czech fiction =====
  
 +The FicTree treebank is a syntactically annotated corpus of Czech fiction. It consists of 135,000 words (166,000 tokens).  The lemmatization, the morphological and syntactic annotation were performed manually.
 <WRAP right 35%> <WRAP right 35%>
 ^ <fs medium>Name</fs> ^^ <fs medium>FicTree</fs> ^ ^ <fs medium>Name</fs> ^^ <fs medium>FicTree</fs> ^
Line 11: Line 11:
 ^ ::: ^ Publication date | 2017  | ^ ::: ^ Publication date | 2017  |
 </WRAP> </WRAP>
-FicTree is a syntactically annotated corpus of Czech fiction. It consists of 12,760 sentences (166,432 tokens).  +===== The composition of the FicTree treebank =====
-The texts come from eight literary works published in the Czech Republic between 1991 and 2007. +
-The text data was manually annotated according to the [[https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Prague Dependency Treebank guidelines]] (annotation on the analytical layer). The texts are shuffled into random chunks of maximum 100 words (respecting sentence boundaries).+
  
-=== Annotation procedure === +The FicTree treebank consists of eight literary works published in the Czech Republic between 1991 and 2007. The texts in the treebank include six fiction titles, a children’s fiction  book, and a book of memoirs 
-The texts were parsed independently by two parsers trained on the Prague Dependency Treebank data (analytical layer). The parsing results were manually  +Most of the texts were first published between 1991 and 2007 except for one text, published in 1969. 
-corrected and the two versions mergedAny differences were resolved manually+Five texts (80% of all tokens) are original Czech texts, the other three are translations (from German and Slovak).
  
-=== Text details ===+===== The syntactic annotation of the treebank =====
  
-The eight texts in the treebank include six fiction titles, children’s fiction  book, and a book of memoirs +The FicTree treebank was syntactically annotated according to the guidelines for the analytical layer of the Prague Dependency Treebank – PDT ([[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT 2.0]] with revisions [[http://ufal.mff.cuni.cz/pdt2.5/cs/documentation.html|2.5]] and [[http://ufal.mff.cuni.cz/pdt3.0|3.0]]). The corpus was parsed using two parsers: ([[https://sourceforge.net/projects/mstparser/|MST Parser]] and [[http://www.maltparser.org/|MaltParser]]) trained on the PDT a-layer train data. The results were manually corrected by annotatorsthen mergedAny differences between the two versions were resolved manually by another annotator.
-Most of the texts were first published between 1991 and 2007 except one textpublished in 1969. +
-80% of the texts are original Czech texts, 20% are translations (from German and Slovak).+
  
-=== References === +===== Access to the treebank =====
-Tomáš Jelínek, 2017. //FicTree: a Manually Annotated Treebank of Czech Fiction//+
-In: J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf+
  
-=== Acknowledgments ===+The FicTree treebank can be accessed in several ways: 
 +  - [[en:cnk:fictree#a_cnc_corpus_in_the_kontext_interface|A CNC corpus in the KonText interface]]: FicTree is available as a [[en:cnk:uvod|CNC corpus]] in the [[en:manualy:kontext:index|KonText]] interface. 
 +  - [[en:cnk:fictree#data_annotated_according_to_pdt_a-layer|Data annotated according to PDT a-layer]]: the data of the FicTree treebank annotated according to the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT a-layer guidelines]] are available for download from the [[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2517|LINDAT/CLARIN]] repository for non-commercial use. 
 +  - [[en:cnk:fictree#data_annotated_in_the_Universal_Dependencies_standard|Data annotated in the Universal Dependencies standard]]: the data of the FicTree treebank annotated according to the [[http://universaldependencies.org/|Universal Dependencies]] standard into which it was automatically converted are available through the [[http://universaldependencies.org/treebanks/cs_fictree/index.html|UD web page]] (for non-commercial use only).
  
-We wish to thank the participants in the annotation effortincluding Milena Hnátková, Ivana Klímová, Alena Kropíková, Hana Skoumalová and Olga Zitová. +===== 1. A CNC corpus in the KonText interface ===== 
 + 
 +The FicTree corpus is available in the same way as other CNC corpora through the [[en:manualy:kontext:index|KonText]] interface. 
 + 
 +The corpus annotation is accessible through a wide range of attributes for each token. The morphological annotation and lemmatization are available using the attributes [[seznamy:tagy|tag]] and [[en:pojmy:lemma|lemma]]; additionally, the information about the POS and nominal case (if applicable) of all tokens is accessible using the attributes **pos** and **case**. 
 + 
 +The syntactic annotation of FicTree can be accessed using several positional attributes (the same as in the SYN2015 corpus): 
 +  * afun – syntactic function according to the a-layer PDT annotation 
 +  * parent – relative position of the governing token 
 +  * eparent – relative position of the nearest governing content word 
 +  * prep – lemma of a preposition governing the token (if any) 
 +  * p_lemma, p_tag, ep_lemma, ep_tag – tag and lemma of the governing token 
 +  * p_pos, p_case, ep_pos, ep_case – POS and case of the governing token 
 +  * p_afun, ep_afun – syntactic function of the governing token 
 + 
 +===== 2. Data annotated according to PDT a-layer ===== 
 + 
 +The data of the FicTree treebank, annotated according to the PDT a-layer guidelines, are available through the  
 +[[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2517|LINDAT/CLARIN]] repository in a ''vertical'' format (tab-separated values), sentence boundaries are marked with empty lines. Each word is written on a single line, followed by five attributes separated by tabulators: **lemma**, **tag**, **ID** (number indicating the position of the token in the sentence), **head** (ID of the governing token) and **afun** (syntactic function according to [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT]]). The texts are divided into segments of maximum 100 tokens (respecting sentence boundaries), the segments are randomly shuffled. Every such segment constitues a single file, the name of the file refers to the literary work, its prefix indicates the intended partition of the data into train, dev and test data (80% - 10% - 10%). 
 + 
 +===== 3. Data annotated in the Universal Dependencies standard ===== 
 + 
 +The morphological and syntactic annotation according to the UD guidelines was performed by converting the original PDT annotation. The conversion procedure was designed by Dan Zeman and implemented in [[https://github.com/ufal/treex|Treex]]. 
 +The data are available on the [[http://universaldependencies.org/treebanks/cs_fictree/index.html|Universal Dependencies]] webpage. They are in the [[http://universaldependencies.org/format.html|CONLL-U format]]. The original texts are divided into segments of maximum 100 tokensthe segments are shuffled and divided into a trainval and test data set. The FicTree treebank in UD standard is also accessible using the query tool  [[https://lindat.mff.cuni.cz/services/pmltq/|PML-TQ]]. 
 + 
 +===== Acknowledgments ===== 
 +We wish to thank the human annotators: Ivana Klímová, Alena Kropíková and Olga Zitová; as well as Dan Zeman for the data conversion. 
 + 
 +===== How to cite FicTree ===== 
 +<WRAP round tip 70%> 
 +Jelínek, T. – Hnátková, M. – Skoumalová, H.: FicTree: manuálně syntakticky anotovaný korpus české beletrie. Ústav Českého národního korpusu FF UK, Praha 2017. Dostupný z WWW: http://www.korpus.cz 
 + 
 +Jelínek, T.: FicTree: a Manually Annotated Treebank of Czech Fiction. In: J. Hlaváčová (Ed.): //ITAT 2017 Proceedings//, pp. 181–185. http://ceur-ws.org/Vol-1885/181.pdf 
 +</WRAP>