Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
en:cnk:syn2020 [2020/12/22 08:54] – [SYN2020 Corpus] michalskrabal | en:cnk:syn2020 [2021/01/21 09:14] – [Multiple lemmatization and tagging (aggregate)] tomasjelinek | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== SYN2020 Corpus ====== | ====== SYN2020 Corpus ====== | ||
- | The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en: | + | The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en: |
- | <WRAP right 35%> | + | <WRAP round tip 70%> |
+ | The design of SYN2020, its composition, | ||
+ | </ | ||
+ | |||
+ | <WRAP right 45%> | ||
^ <fs medium> | ^ <fs medium> | ||
^ Positions ^ Number of positions (tokens) | 121 826 797 | | ^ Positions ^ Number of positions (tokens) | 121 826 797 | | ||
^ ::: ^ Number of positions (excl. punctuation) | 100 031 037 | | ^ ::: ^ Number of positions (excl. punctuation) | 100 031 037 | | ||
- | ^ ::: ^ Number of word forms | 1 751 599 | | + | ^ ::: ^ Number of word forms | 1 701 465 | |
- | ^ ::: ^ Number of lemmas | | + | ^ ::: ^ Number of lemmas | |
^ Structures ^ Number of documents <doc> | 3 910 | | ^ Structures ^ Number of documents <doc> | 3 910 | | ||
^ ::: ^ Number of texts < | ^ ::: ^ Number of texts < | ||
Line 17: | Line 21: | ||
^ ::: ^ Publication year | 2020 | | ^ ::: ^ Publication year | 2020 | | ||
</ | </ | ||
- | ===== Changes | + | |
+ | ====== Composition of SYN2020 ====== | ||
+ | |||
+ | ==== Representativeness ==== | ||
+ | |||
+ | SYN2020 contains a large spectrum of different types of texts in order to cover vast majority of varieties the corpus aims to represent. This corresponds to Biber’s notion of representativeness in terms of texts as products. The corpus is designed as representative, | ||
+ | |||
+ | ==== Text classification ==== | ||
+ | |||
+ | The classification of texts in SYN2020 is based on external, non-text criteria and is hierarchical. The highest level is determined by the three already mentioned text macrotypes ('' | ||
+ | |||
+ | ^ Txtype_group ^ Portion ^ | ||
+ | | FIC: fiction | 33,33 % | | ||
+ | | NFC: non-fiction | 33,33 % | | ||
+ | | NMG: newspapers and magazines | 33,33 % | | ||
+ | |||
+ | In line with its predecessors, | ||
+ | |||
+ | Next to the text type and genre, metadata related to the text classification and available for every document also include medium (book, journal, textbook etc.), periodicity (daily, weekly, monthly, less than monthly, non-periodical) and audience (general, children/ | ||
+ | |||
+ | A more detailed description of the text types contained within the macrogroups: | ||
+ | |||
+ | ^ txtype | ||
+ | | **Fiction** (FIC) ||| 33,33 % | | ||
+ | | NOV | | novels | 26 % | | ||
+ | | COL | | short stories | 5 % | | ||
+ | | VER | | poetry | 1 % | | ||
+ | | SCR | | drama, screenplays | 1 % | | ||
+ | | X | | other | 0,33 % | | ||
+ | | **Non-fiction** (NFC) ||| 33,33 % | | ||
+ | | SCI (scientific)\\ \\ PRO (professional)\\ \\ POP (popular) | HUM | humanities | 7 % | | ||
+ | | ::: | SSC | social sciences | 7 % | | ||
+ | | ::: | NAT | natural sciences | 7 % | | ||
+ | | ::: | FTS | technical sciences | 7 % | | ||
+ | | ::: | ITD | interdisciplinary | 1 % | | ||
+ | | MEM | | memoirs, autobiographies | 4 % | | ||
+ | | ADM | | administrative texts | 0,33 % | | ||
+ | | **Newspapers and magazines** (NMG) ||| 33,33 % | | ||
+ | | NEW | NTW | nationawide newspapers – selected titles (MF, LN, HN, Právo) | 10 % | | ||
+ | | ::: | NTW | nationawide newspapers – other | 5 % | | ||
+ | | ::: | REG | regional newspapers | 5 % | | ||
+ | | LEI | | leisure magazines | 13,33 % | | ||
+ | |||
+ | A detailed information about the text classification scheme is available [[https:// | ||
+ | |||
+ | ==== Concept of synchronicity ==== | ||
+ | |||
+ | We are working under the assumption that a [[en: | ||
+ | |||
+ | * for fiction it is 25 + 75, i.e. the time elapsed since the first publication is less than 75 years (approximately three living generations) and the given issue of the text being added to the corpus is no older than 25 years (ensuring reception in the present), | ||
+ | * for non-fiction texts the first issue must be no older than 25 years, | ||
+ | * the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019). | ||
+ | |||
+ | ===== Annotation of SYN2020: changes compared | ||
==== Tokenization ==== | ==== Tokenization ==== | ||
Line 23: | Line 80: | ||
In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (//řekl , že//) and some other combinations, | In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (//řekl , že//) and some other combinations, | ||
- | In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified | + | In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified |
==== Lemmatization ==== | ==== Lemmatization ==== | ||
Line 51: | Line 108: | ||
==== Multiple lemmatization and tagging (aggregate) ==== | ==== Multiple lemmatization and tagging (aggregate) ==== | ||
- | In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates**, | + | In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates** |
+ | |||
+ | ==== Automatic corpus annotation ==== | ||
+ | For SYN2020, the entire annotation process is automatic. Its detailed description including the annotation accuracy and a rich bibliography to both the tools and data can be found on a [[cnk: | ||
+ | |||
+ | ====== How to cite SYN2020 ====== | ||
+ | <WRAP round tip 70%> | ||
+ | Křen, M. – Cvrček, V. – Henyš, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, | ||
+ | </ |