Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
en:cnk:syn2020 [2020/12/22 09:43] – michalskrabal | en:cnk:syn2020 [2020/12/27 12:15] – [Annotation of SYN2020: changes with respect to other corpora of the SYN series] michalkren | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== SYN2020 Corpus ====== | ====== SYN2020 Corpus ====== | ||
- | The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en: | + | The SYN2020 corpus is a synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words, including punctuation (tokens). It is a sequel of the representative corpora of the SYN series ([[en: |
+ | |||
+ | <WRAP round tip 70%> | ||
+ | The design of SYN2020, its composition, | ||
+ | </ | ||
<WRAP right 35%> | <WRAP right 35%> | ||
Line 7: | Line 11: | ||
^ Positions ^ Number of positions (tokens) | 121 826 797 | | ^ Positions ^ Number of positions (tokens) | 121 826 797 | | ||
^ ::: ^ Number of positions (excl. punctuation) | 100 031 037 | | ^ ::: ^ Number of positions (excl. punctuation) | 100 031 037 | | ||
- | ^ ::: ^ Number of word forms | 1 751 599 | | + | ^ ::: ^ Number of word forms | 1 701 465 | |
- | ^ ::: ^ Number of lemmas | | + | ^ ::: ^ Number of lemmas | |
^ Structures ^ Number of documents <doc> | 3 910 | | ^ Structures ^ Number of documents <doc> | 3 910 | | ||
^ ::: ^ Number of texts < | ^ ::: ^ Number of texts < | ||
Line 32: | Line 36: | ||
| NFC: non-fiction | 33,33 % | | | NFC: non-fiction | 33,33 % | | ||
| NMG: newspapers and magazines | 33,33 % | | | NMG: newspapers and magazines | 33,33 % | | ||
- | |||
- | [{{: | ||
- | [{{: | ||
- | |||
In line with its predecessors, | In line with its predecessors, | ||
Line 74: | Line 74: | ||
* the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019). | * the boundaries for the synchrony of newspapers and magazines remains unchanged, i.e. the text must have been published in the period which is being mapped by the corpus (in the case of SYN2020 it is the period between 2015 and 2019). | ||
- | The resulting makeup of the corpus in no. of words over the years is summarized by the following graph. | + | ===== Annotation of SYN2020: changes compared |
- | + | ||
- | | + | |
- | + | ||
- | ===== Changes with respect | + | |
==== Tokenization ==== | ==== Tokenization ==== | ||
Line 84: | Line 80: | ||
In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (//řekl , že//) and some other combinations, | In the existing corpora of the SYN series, almost all combinations of alphabetic, numeric characters and punctuation marks that were written in the original texts without a space have so far been considered one token. Only punctuation marks at word boundaries (//řekl , že//) and some other combinations, | ||
- | In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified | + | In SYN2020, the approach is opposite: numeric characters and punctuation marks are systematically identified |
==== Lemmatization ==== | ==== Lemmatization ==== | ||
Line 115: | Line 111: | ||
====== How to cite SYN2020 ====== | ====== How to cite SYN2020 ====== | ||
+ | <WRAP round tip 70%> | ||
+ | Křen, M. – Cvrček, V. – Henyš, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, | ||
+ | </ | ||
- | ====== Related links ====== | ||
- | |||
- | <WRAP round box 49%> | ||
- | [[en: | ||
- | </ |