Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:syn2015 [2016/12/11 10:35] – [Poziční anotace a značkování] veronikapojarova | en:cnk:syn2015 [2020/09/01 17:46] (current) – [Positional annotation and tagging] michalkren |
---|
==== Text classification ==== | ==== Text classification ==== |
| |
The original **text classification** scheme of the SYN series has been updated and revised; both original and revised classifications are based on text-external criteria that reflect predominant function of a text. The revision has been made with respect to comparability with the original scheme, with the most significant change made to the sub-classification of non-fiction adopted from the [[http://www.en.nkp.cz|Czech National Library]] and more detailed classification of newspaper texts. | The original **text classification** scheme of the SYN series has been [[https://wiki.korpus.cz/doku.php/en:cnk:klasifikace_textu_syn2015|updated and revised]]; both original and revised classifications are based on text-external criteria that reflect predominant function of a text. The revision has been made with respect to comparability with the original scheme, with the most significant change made to the sub-classification of non-fiction adopted from the [[http://www.en.nkp.cz|Czech National Library]] and more detailed classification of newspaper texts. |
| |
^ Txtype_group ^ Portion ^ | ^ Txtype_group ^ Portion ^ |
| |
| |
---- | <WRAP clear></WRAP> |
| |
In line with its predecessors, SYN2015 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2015 are set arbitrarily, yet close to the original figures. | In line with its predecessors, SYN2015 contains a large variety of texts from various publishers within the given classification category. A category is defined by a combination of two variables: text type and genre. Proportions of the particular categories in SYN2015 are set arbitrarily, yet close to the original figures. |
| ::: | REG | regional newspapers | 5 % | | | ::: | REG | regional newspapers | 5 % | |
| LEI | | leisure magazines | 13,33 % | | | LEI | | leisure magazines | 13,33 % | |
| |
| A detailed information about the text classification scheme is available [[https://wiki.korpus.cz/doku.php/en:cnk:klasifikace_textu_syn2015|here]]. |
| |
==== Concept of synchronicity ==== | ==== Concept of synchronicity ==== |
Compared to previous versions there have been improvements in [[en:pojmy:lemma|lemmatization]] and [[en:pojmy:morfologicka_analyza|morphological tagging]]; both are almost identical to the processes used for the corpus [[en:cnk:syn2013pub|SYN2013PUB]], nonetheless SYN2015 was processed using the newest versions of all the tools (the improvements relate both to the morphological dictionary and to the rule-based [[en:pojmy:desambiguace|disambiguation]]). Furthermore, the lemmatization of punctuation marks has changed, preserving the form of the characters as much as possible. | Compared to previous versions there have been improvements in [[en:pojmy:lemma|lemmatization]] and [[en:pojmy:morfologicka_analyza|morphological tagging]]; both are almost identical to the processes used for the corpus [[en:cnk:syn2013pub|SYN2013PUB]], nonetheless SYN2015 was processed using the newest versions of all the tools (the improvements relate both to the morphological dictionary and to the rule-based [[en:pojmy:desambiguace|disambiguation]]). Furthermore, the lemmatization of punctuation marks has changed, preserving the form of the characters as much as possible. |
| |
==== Struktura korpusu a strukturní značky ==== | Last but not least, SYN2015 is the first CNC corpus featuring a **[[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_analyza|syntactic annotation]]**. |
| |
Struktura předchozích korpusů řady SYN se většinou řídila hierarchií ''<opus>'' – ''<doc>'' – ''<s>'' (tj. ucelený text nebo soubor textů – oddíl nebo kapitola – věta). V korpusu SYN2015 je tato hierarchie změněna a doplněna. Nejvyšší [[pojmy:atributy_strukturni|strukturní jednotkou]] je ve shodě s mezinárodní konvencí dokument ''<doc>'', který se skládá z jednoho nebo několika textů ''<text>'' (články v periodiku, kapitoly v knize nebo jiné smysluplné úseky). Texty se dále člení do odstavců ''<p>'' a vět ''<s>''. Každá z těchto struktur je charakterizována konkrétními atributy, jejichž přehled uvádíme v následující tabulce. Kromě těchto hierarchických struktur jsou v korpusu zaznamenány také struktury ''<hi>'' (zvýraznění a řezy písma) a ''<lb>'' (označení hranice verše v poezii). | |
| |
^ ''<doc>'' ^ Poznámka ^ ''<text>'' ^ Poznámka ^ ''<p>'' ^ Poznámka ^'' <s>'' ^ Poznámka ^ | |
| title | název dokumentu nebo periodika | [[seznamy:section|section]] | generovaný typ rubriky (u vybraných periodik) | type | běžný odstavec/nadpis | id | unique identifier | | |
| subtitle | podtitul | [[seznamy:section|section_orig]] | původní název rubriky (u vybraných periodik) | id | jednoznačný identifikátor | | | | |
| author | autor dokumentu | author | autor článku (u vybraných periodik) | | | | | | |
| issue | vydání (u periodik) | id | unique identifier | | | | | | |
| publisher | vydavatel | | | | | | | | |
| pubplace | place of publishing | | | | | | | | |
| pubyear | year published | | | | | | | | |
| first_published | year of 1st publication | | | | | | | | |
| translator | překladatel | | | | | | | | |
| [[seznamy:srclang|srclang]] | zdrojový jazyk | | | | | | | | |
| [[seznamy:authsex-transsex|authsex]] | pohlaví autora | | | | | | | | |
| [[seznamy:authsex-transsex|transsex]] | pohlaví překladatele | | | | | | | | |
| [[seznamy:txtype_group|txtype_group]] | skupina textových typů | | | | | | | | |
| [[seznamy:txtype|txtype]] | textový typ | | | | | | | | |
| [[seznamy:genre_group|genre_group]] | skupina oborů | | | | | | | | |
| [[seznamy:genre|genre]] | tematická oblast | | | | | | | | |
| [[seznamy:med|medium]] | médium | | | | | | | | |
| [[seznamy:periodicity|periodicity]] | periodicita | | | | | | | | |
| [[seznamy:audience|audience]] | adresát | | | | | | | | |
| isbnissn | ISBN/ISSN | | | | | | | | |
| biblio | generovaný bibliografický údaj | | | | | | | | |
| id | jednoznačný identifikátor | | | | | | | | |
| |
====== How to cite SYN2015 ====== | ====== How to cite SYN2015 ====== |
| |
Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková, D. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Škrabal, M. – Truneček, P. – Vondřička, P. – Zasina, A. (2016): [[http://www.lrec-conf.org/proceedings/lrec2016/pdf/186_Paper.pdf|SYN2015: Representative Corpus of Contemporary Written Czech]]. In: //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)//, 2522–2528. Portorož: ELRA. ISBN 978-2-9517408-9-1. | Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, A. – Hnátková, M. – Chlumská, L. – Jelínek, T. – Kováříková, D. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Škrabal, M. – Truneček, P. – Vondřička, P. – Zasina, A. (2016): [[http://www.lrec-conf.org/proceedings/lrec2016/pdf/186_Paper.pdf|SYN2015: Representative Corpus of Contemporary Written Czech]]. In: //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)//, 2522–2528. Portorož: ELRA. ISBN 978-2-9517408-9-1. |
| </WRAP> |
| |
| ====== Related links ====== |
| |
| <WRAP round box 49%> |
| [[en:cnk:syn|SYN]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:syn2013PUB|SYN2013PUB]] |
</WRAP> | </WRAP> |
| |