Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:syn [2017/04/25 18:56] – [Reference corpora as subcorpora in SYN] michalkren | en:cnk:syn [2023/12/29 12:21] (current) – michalkren |
---|
~~NOTOC~~ | ~~NOTOC~~ |
| |
====== Corpus SYN ====== | ====== SYN corpus ====== |
| |
The **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools). | **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools). |
| |
The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. | The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. |
^ <fs medium> SYN corpus versions</fs> ^^^^ | ^ <fs medium> SYN corpus versions</fs> ^^^^ |
^ version ^ year of publication ^ size (no. of words) ^ content ^ | ^ version ^ year of publication ^ size (no. of words) ^ content ^ |
^ [[en:cnk:syn:verze5|SYN version 5]] | 2017 | FIXME 3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | | ^ [[en:cnk:syn:verze12|SYN version 12]] | 2023 | 5.175G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze11|SYN version 11]] | 2022 | 5.032G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze10|SYN version 10]] | 2022 | 4.882G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze9|SYN version 9]] | 2021 | 4.719G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze8|SYN version 8]] | 2019 | 4.499G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze7|SYN version 7]] | 2018 | 4.255G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze6|SYN version 6]] | 2017 | 4.033G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze5|SYN version 5]] | 2017 | 3.836G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
^ [[en:cnk:syn:verze4|SYN version 4]] | 2016 | 3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | | ^ [[en:cnk:syn:verze4|SYN version 4]] | 2016 | 3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
^ [[en:cnk:syn:verze3|SYN version 3]] | 2014 | 2.232G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]] | | ^ [[en:cnk:syn:verze3|SYN version 3]] | 2014 | 2.232G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]] | |
====== Advantages of the SYN corpus ====== | ====== Advantages of the SYN corpus ====== |
| |
* access to extensive language data (more than 4.5 billion words) | * access to extensive language data (more than 5 billion words) |
* possibility to search all the SYN-series corpora at the same time | * possibility to search all the SYN-series corpora at the same time |
* possibility to create subcorpora that correspond to the original corpora | * possibility to create subcorpora that correspond to the original corpora |
* re-processing of the original corpora by continuously improved tools | * re-processing of the original corpora by continuously improved tools |
* referentiality, i.e. every version is an invariable entity ensuring that identical queries always give identical results | * referentiality, i.e. its individual versions are invariable entities that remain unchanged once published |
| |
====== How to cite SYN ====== | ====== How to cite SYN ====== |
| |
| |
--- //Michal Křen, Olga Richterová// | --- //Michal Křen, Olga Richterová, Michal Škrabal// |
| |
====== Related links ====== | ====== Related links ====== |