| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:cnk:syn [2017/12/18 18:53] – [Corpus SYN] michalkren | en:cnk:syn [2026/01/23 10:07] (current) – michalkren |
|---|
| ~~NOTOC~~ | ~~NOTOC~~ |
| |
| ====== Corpus SYN ====== | ====== SYN corpus ====== |
| |
| The **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools). | **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools). |
| |
| The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. | The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. |
| ^ <fs medium> SYN corpus versions</fs> ^^^^ | ^ <fs medium> SYN corpus versions</fs> ^^^^ |
| ^ version ^ year of publication ^ size (no. of words) ^ content ^ | ^ version ^ year of publication ^ size (no. of words) ^ content ^ |
| | ^ [[en:cnk:syn:verze14|SYN version 14]] | 2025 | 5.489G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], [[en:cnk:syn2025|SYN2025]], other journalistic texts | |
| | ^ [[en:cnk:syn:verze13|SYN version 13]] | 2024 | 5.310G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| | ^ [[en:cnk:syn:verze12|SYN version 12]] | 2023 | 5.175G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| | ^ [[en:cnk:syn:verze11|SYN version 11]] | 2022 | 5.032G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| | ^ [[en:cnk:syn:verze10|SYN version 10]] | 2022 | 4.882G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| | ^ [[en:cnk:syn:verze9|SYN version 9]] | 2021 | 4.719G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| | ^ [[en:cnk:syn:verze8|SYN version 8]] | 2019 | 4.499G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| | ^ [[en:cnk:syn:verze7|SYN version 7]] | 2018 | 4.255G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze6|SYN version 6]] | 2017 | 4.033G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | | ^ [[en:cnk:syn:verze6|SYN version 6]] | 2017 | 4.033G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze5|SYN version 5]] | 2017 | 3.836G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | | ^ [[en:cnk:syn:verze5|SYN version 5]] | 2017 | 3.836G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ====== Advantages of the SYN corpus ====== | ====== Advantages of the SYN corpus ====== |
| |
| * access to extensive language data (more than 4.5 billion words) | * access to extensive language data (more than 5 billion words) |
| * possibility to search all the SYN-series corpora at the same time | * possibility to search all the SYN-series corpora at the same time |
| * possibility to create subcorpora that correspond to the original corpora | * possibility to create subcorpora that correspond to the original corpora |
| * re-processing of the original corpora by continuously improved tools | * re-processing of the original corpora by continuously improved tools |
| * referentiality, i.e. its individual versions are invariable entities that remain unchanged once published | * referentiality, i.e. its individual versions are invariable entities that remain unchanged once published |
| | |
| | ====== Disadvantage of the SYN corpus ====== |
| | * its size causes some operations to be too slow |
| | |
| |
| ====== How to cite SYN ====== | ====== How to cite SYN ====== |
| |
| |
| --- //Michal Křen, Olga Richterová// | --- //Michal Křen, Olga Richterová, Michal Škrabal// |
| | |
| ====== Related links ====== | |
| <WRAP round box 50%> | |
| [[en:cnk:syn:verze3|SYN version 3]] • [[en:cnk:syn:verze4|SYN version 4]] • [[en:cnk:syn2000|SYN2000]] • [[en:cnk:syn2005|SYN2005]] • [[en:cnk:syn2006pub|SYN2006PUB]] • [[en:cnk:syn2009pub|SYN2009PUB]] • [[en:cnk:syn2010|SYN2010]] • [[en:cnk:SYN2013PUB|SYN2013PUB]] • [[en:cnk:syn2015|SYN2015]] | |
| </WRAP> | |
| | |
| |