Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:syn [2017/04/25 18:55] – [The current SYN as a solution] michalkren | en:cnk:syn [2023/12/29 12:21] (current) – michalkren |
---|
~~NOTOC~~ | ~~NOTOC~~ |
| |
====== Corpus SYN ====== | ====== SYN corpus ====== |
| |
The **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools). | **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools). |
| |
The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. | The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. |
^ <fs medium> SYN corpus versions</fs> ^^^^ | ^ <fs medium> SYN corpus versions</fs> ^^^^ |
^ version ^ year of publication ^ size (no. of words) ^ content ^ | ^ version ^ year of publication ^ size (no. of words) ^ content ^ |
^ [[en:cnk:syn:verze5|SYN version 5]] | 2017 | FIXME 3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | | ^ [[en:cnk:syn:verze12|SYN version 12]] | 2023 | 5.175G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze11|SYN version 11]] | 2022 | 5.032G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze10|SYN version 10]] | 2022 | 4.882G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze9|SYN version 9]] | 2021 | 4.719G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | |
| ^ [[en:cnk:syn:verze8|SYN version 8]] | 2019 | 4.499G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze7|SYN version 7]] | 2018 | 4.255G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze6|SYN version 6]] | 2017 | 4.033G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
| ^ [[en:cnk:syn:verze5|SYN version 5]] | 2017 | 3.836G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
^ [[en:cnk:syn:verze4|SYN version 4]] | 2016 | 3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | | ^ [[en:cnk:syn:verze4|SYN version 4]] | 2016 | 3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | |
^ [[en:cnk:syn:verze3|SYN version 3]] | 2014 | 2.232G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]] | | ^ [[en:cnk:syn:verze3|SYN version 3]] | 2014 | 2.232G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]] | |
| |
==== Reference corpora as subcorpora in SYN ==== | ==== Reference corpora as subcorpora in SYN ==== |
The possibility of searching the revised texts of all the SYN-series corpora joined together is supplemented also by the possibility of creating subcorpora which have the same composition as the original corpora. This is enabled due to the attribute ''<doc syn>'', e.g. a subcorpus corresponding to SYN2005 can be created by applying the condition ''syn=<nowiki>"</nowiki>2005<nowiki>"</nowiki>'' on the structural attribute ''<doc>''. Of course, this condition may be further combined with other ones that specify required text type, publication date etc. More information can be found in the [[kurz:pokrocile_dotazy|manual]] (Czech only). **It is therefore possible to use corpus SYN also for work with older representative corpora re-processed by the latest corpus tools.** Naturally, there may be found differences between the original corpora and the corresponding new subcorpora caused by different processing. These changes may include not only different [[en:pojmy:lemma|lemmatization]], but also different frequency of word forms or different number of positions, as these are the results of the tokenization. | The possibility of searching the revised texts of all the SYN-series corpora joined together is supplemented also by the possibility of creating subcorpora which have the same composition as the original corpora. This is enabled due to the attribute ''<doc syn>'', e.g. a subcorpus corresponding to SYN2005 can be created by applying the condition ''syn=<nowiki>"</nowiki>2005<nowiki>"</nowiki>'' on the structural attribute ''<doc>''. Of course, this condition may be further combined with other ones that specify required text type, publication date etc. More information can be found in the [[kurz:pokrocile_dotazy|manual]] (Czech only). It is therefore possible to use corpus SYN also for **work with older representative corpora** re-processed by the latest corpus tools. Naturally, there may be found differences between the original corpora and the corresponding new subcorpora caused by different processing. These changes may include not only different [[en:pojmy:lemma|lemmatization]], but also different frequency of word forms or different number of positions, as these are the results of the tokenization. |
| |
====== Advantages of the SYN corpus ====== | ====== Advantages of the SYN corpus ====== |
| |
* access to extensive language data (more than 4.5 billion words) | * access to extensive language data (more than 5 billion words) |
* possibility to search all the SYN-series corpora at the same time | * possibility to search all the SYN-series corpora at the same time |
* possibility to create subcorpora that correspond to the original corpora | * possibility to create subcorpora that correspond to the original corpora |
* re-processing of the original corpora by continuously improved tools | * re-processing of the original corpora by continuously improved tools |
* referentiality, i.e. every version is an invariable entity ensuring that identical queries always give identical results | * referentiality, i.e. its individual versions are invariable entities that remain unchanged once published |
| |
====== How to cite SYN ====== | ====== How to cite SYN ====== |
| |
| |
--- //Michal Křen, Olga Richterová// | --- //Michal Křen, Olga Richterová, Michal Škrabal// |
| |
====== Related links ====== | ====== Related links ====== |