Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
en:cnk:syn2015 [2016/06/30 13:58] – vaclavcvrcek | en:cnk:syn2015 [2020/09/01 17:33] – [Text classification] michalkren | ||
---|---|---|---|
Line 1: | Line 1: | ||
~~NOTOC~~ | ~~NOTOC~~ | ||
====== Corpus SYN2015 ====== | ====== Corpus SYN2015 ====== | ||
+ | |||
+ | SYN2015 is a representative corpus of contemporary written Czech published in December 2015. SYN2015 is a sequel of the representative corpora of the SYN series ([[en: | ||
+ | |||
<WRAP right 35%> | <WRAP right 35%> | ||
Line 16: | Line 19: | ||
^ ::: ^ Publication year | 2015 | | ^ ::: ^ Publication year | 2015 | | ||
</ | </ | ||
+ | ===== Changes compared to other SYN series corpora ===== | ||
- | SYN2015 is a representative corpus | + | ==== The concept |
- | + | ||
- | Approach adopted to **representativeness** differs from previous corpora of the SYN-series. SYN2015 is designed to contain a large number of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber' | + | |
SYN2015 is designed as a representation of **contemporary printed language** of the last five-year period, i.e. 2010–2014. As the borders of synchronicity vary across the registers, the following criteria for inclusion of the individual texts into SYN2015 have been adopted (based on the three top-level categories, cf. below): | SYN2015 is designed as a representation of **contemporary printed language** of the last five-year period, i.e. 2010–2014. As the borders of synchronicity vary across the registers, the following criteria for inclusion of the individual texts into SYN2015 have been adopted (based on the three top-level categories, cf. below): | ||
Line 25: | Line 27: | ||
* non-fiction: | * non-fiction: | ||
* newspapers and magazines: publication date within the given five-year period. | * newspapers and magazines: publication date within the given five-year period. | ||
+ | ==== Representativeness in SYN2015 ==== | ||
+ | |||
+ | The approach adopted to **representativeness** differs from previous corpora of the SYN-series. SYN2015 contains a large spectrum of different types of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber' | ||
+ | |||
+ | ==== Text classification ==== | ||
The original **text classification** scheme of the SYN series has been updated and revised; both original and revised classifications are based on text-external criteria that reflect predominant function of a text. The revision has been made with respect to comparability with the original scheme, with the most significant change made to the sub-classification of non-fiction adopted from the [[http:// | The original **text classification** scheme of the SYN series has been updated and revised; both original and revised classifications are based on text-external criteria that reflect predominant function of a text. The revision has been made with respect to comparability with the original scheme, with the most significant change made to the sub-classification of non-fiction adopted from the [[http:// | ||
+ | |||
+ | ^ Txtype_group ^ Portion ^ | ||
+ | | FIC: fiction | 33,33 % | | ||
+ | | NFC: non-fiction | 33,33 % | | ||
+ | | NMG: newspapers and magazines | 33,33 % | | ||
+ | |||
+ | [{{: | ||
+ | [{{: | ||
+ | |||
+ | |||
+ | <WRAP clear></ | ||
+ | |||
+ | In line with its predecessors, | ||
+ | |||
+ | Next to the text type and genre, metadata related to the text classification and available for every document also include medium (book, journal, textbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (general, children/ | ||
+ | |||
+ | A more detailed description of the text types contained within the macro groups: | ||
^ txtype | ^ txtype | ||
Line 36: | Line 60: | ||
| X | | other | 0,33 % | | | X | | other | 0,33 % | | ||
| **Non-fiction** (NFC) ||| 33,33 % | | | **Non-fiction** (NFC) ||| 33,33 % | | ||
- | | SCI/PRO/POP | HUM | humanities | 7 % | | + | | SCI (scientific)\\ \\ PRO (professional)\\ \\ POP (popular) |
| ::: | SSC | social sciences | 7 % | | | ::: | SSC | social sciences | 7 % | | ||
| ::: | NAT | natural sciences | 7 % | | | ::: | NAT | natural sciences | 7 % | | ||
Line 49: | Line 73: | ||
| LEI | | leisure magazines | 13,33 % | | | LEI | | leisure magazines | 13,33 % | | ||
- | In line with its predecessors, | + | A detailed information about the text classification |
- | Next to the text type and genre, metadata related to the text classification and available | + | ==== Concept of synchronicity ==== |
+ | |||
+ | We are working under the assumption that a [[en: | ||
+ | |||
+ | * for fiction it is 25 + 75, i.e. the time elapsed since the first publication is less than 75 years (approximately three living generations) and the given issue of the text being added to the corpus is no older than 25 years (ensuring reception in the present), | ||
+ | * for non-fiction texts the first issue must be no older than 25 years, | ||
+ | * the boundaries for the synchrony of newspapers | ||
+ | |||
+ | The resulting makeup | ||
+ | |||
+ | | ||
+ | |||
+ | ==== Positional annotation and tagging ==== | ||
+ | |||
+ | Compared to previous versions there have been improvements in [[en: | ||
- | [{{: | ||
- | [{{: | ||
- | [{{: | ||
====== How to cite SYN2015 ====== | ====== How to cite SYN2015 ====== | ||
Line 65: | Line 100: | ||
Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, | Křen, M. – Cvrček, V. – Čapka, T. – Čermáková, | ||
+ | </ | ||
+ | |||
+ | ====== Related links ====== | ||
+ | |||
+ | <WRAP round box 49%> | ||
+ | [[en: | ||
</ | </ | ||