Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision |
en:cnk:fictree [2017/12/15 15:16] – [The composition of the FicTree treebank] tomasjelinek | en:cnk:fictree [2017/12/18 09:40] – [1. A CNC corpus in the KonText interface] luciechlumska |
---|
| |
The FicTree treebank consists of eight literary works published in the Czech Republic between 1991 and 2007. The texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. | The FicTree treebank consists of eight literary works published in the Czech Republic between 1991 and 2007. The texts in the treebank include six fiction titles, a children’s fiction book, and a book of memoirs. |
Most of the texts were first published between 1991 and 2007 except one text, published in 1969. | Most of the texts were first published between 1991 and 2007 except for one text, published in 1969. |
Five texts (80% of all tokens) are original Czech texts, the other three are translations (from German and Slovak). | Five texts (80% of all tokens) are original Czech texts, the other three are translations (from German and Slovak). |
| |
===== The syntactic annotation of the treebank ===== | ===== The syntactic annotation of the treebank ===== |
| |
The FicTree treebank was syntactically annotated according to the guidelines for the analytical layer of the Prague Dependency Treebank - PDT ([[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT 2.0]] with revisions [[http://ufal.mff.cuni.cz/pdt2.5/cs/documentation.html|2.5]] and [[http://ufal.mff.cuni.cz/pdt3.0|3.0]]). The corpus was parsed using two parsers: ([[https://sourceforge.net/projects/mstparser/|MST Parser]] and [[http://www.maltparser.org/|MaltParser]]) trained on the PDT a-layer train data. The results were manually corrected by annotators, then merged. Any differences between the two versions were resolved manually by another annotator. | The FicTree treebank was syntactically annotated according to the guidelines for the analytical layer of the Prague Dependency Treebank – PDT ([[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT 2.0]] with revisions [[http://ufal.mff.cuni.cz/pdt2.5/cs/documentation.html|2.5]] and [[http://ufal.mff.cuni.cz/pdt3.0|3.0]]). The corpus was parsed using two parsers: ([[https://sourceforge.net/projects/mstparser/|MST Parser]] and [[http://www.maltparser.org/|MaltParser]]) trained on the PDT a-layer train data. The results were manually corrected by annotators, then merged. Any differences between the two versions were resolved manually by another annotator. |
| |
===== Access to the treebank ===== | ===== Access to the treebank ===== |
The FicTree treebank can be accessed in several ways: | The FicTree treebank can be accessed in several ways: |
- [[en:cnk:fictree#a_cnc_corpus_in_the_kontext_interface|A CNC corpus in the KonText interface]]: FicTree is available as a [[en:cnk:uvod|CNC corpus]] in the [[en:manualy:kontext:index|KonText]] interface. | - [[en:cnk:fictree#a_cnc_corpus_in_the_kontext_interface|A CNC corpus in the KonText interface]]: FicTree is available as a [[en:cnk:uvod|CNC corpus]] in the [[en:manualy:kontext:index|KonText]] interface. |
- [[en:cnk:fictree#data_annotated_according_to_pdt_a-layer|Data annotated according to PDT a-layer]]: the data of the FicTree treebank annotated according to the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT a-layer guidelines]] are available to download from the [[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2517|LINDAT/CLARIN]] repository for non-commercial use. | - [[en:cnk:fictree#data_annotated_according_to_pdt_a-layer|Data annotated according to PDT a-layer]]: the data of the FicTree treebank annotated according to the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/cz/a-layer/html/index.html|PDT a-layer guidelines]] are available for download from the [[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2517|LINDAT/CLARIN]] repository for non-commercial use. |
- [[en:cnk:fictree#data_annotated_in_the_Universal_Dependencies_standard|Data annotated in the Universal Dependencies standard]]: the data of the FicTree treebank annotated according to the [[http://universaldependencies.org/|Universal Dependencies]] standard into which it was automatically converted (for non-commercial use only). | - [[en:cnk:fictree#data_annotated_in_the_Universal_Dependencies_standard|Data annotated in the Universal Dependencies standard]]: the data of the FicTree treebank annotated according to the [[http://universaldependencies.org/|Universal Dependencies]] standard into which it was automatically converted are available through the [[http://universaldependencies.org/treebanks/cs_fictree/index.html|UD web page]] (for non-commercial use only). |
| |
===== 1. A CNC corpus in the KonText interface ===== | ===== 1. A CNC corpus in the KonText interface ===== |
The FicTree corpus is available in the same way as other CNC corpora through the [[en:manualy:kontext:index|KonText]] interface. | The FicTree corpus is available in the same way as other CNC corpora through the [[en:manualy:kontext:index|KonText]] interface. |
| |
The corpus annotation is accessible through a wide range of attributes of each token. The morphologic and annotation and lemmatization is available using the attributes [[seznamy:tagy|tag]] and [[en:pojmy:lemma|lemma]], additionally, the information about the POS and nominal case (if applicable) of all tokens is accessible using the attributes **pos** and **case**. | The corpus annotation is accessible through a wide range of attributes for each token. The morphological annotation and lemmatization are available using the attributes [[seznamy:tagy|tag]] and [[en:pojmy:lemma|lemma]]; additionally, the information about the POS and nominal case (if applicable) of all tokens is accessible using the attributes **pos** and **case**. |
| |
The syntactic annotation of FicTree can be accessed using several positional attributes (the same as in the corpus SYN2015): | The syntactic annotation of FicTree can be accessed using several positional attributes (the same as in the SYN2015 corpus): |
* afun – syntactic function according to the a-layer PDT annotation | * afun – syntactic function according to the a-layer PDT annotation |
* parent – relative position of the governing token | * parent – relative position of the governing token |
* eparent – relative position of the nearest governing content word | * eparent – relative position of the nearest governing content word |
* prep – lemma of a preposition governing the token (if any) | * prep – lemma of a preposition governing the token (if any) |
* p_lemma, p_tag, ep_lemma, ep_tag – tag a lemma of the governing token | * p_lemma, p_tag, ep_lemma, ep_tag – tag and lemma of the governing token |
* p_pos, p_case, ep_pos, ep_case – POS and case of the governing token | * p_pos, p_case, ep_pos, ep_case – POS and case of the governing token |
* p_afun, ep_afun – syntactic function of the governing token | * p_afun, ep_afun – syntactic function of the governing token |