AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn [2019/12/18 15:45] – [Corpus SYN] michalkrenen:cnk:syn [2023/12/29 12:21] (current) michalkren
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
  
-====== Corpus SYN ======+====== SYN corpus ======
  
-The **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools).+**SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools).
  
 The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility.
Line 11: Line 11:
 ^ <fs medium> SYN corpus versions</fs> ^^^^ ^ <fs medium> SYN corpus versions</fs> ^^^^
 ^ version ^ year of publication ^ size (no. of words) ^ content ^ ^ version ^ year of publication ^ size (no. of words) ^ content ^
-^ [[en:cnk:syn:verze7|SYN version 8]] |  2019  |  4.499G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts |+^ [[en:cnk:syn:verze12|SYN version 12]] |  2023  |  5.175G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
 +^ [[en:cnk:syn:verze11|SYN version 11]] |  2022  |  5.032G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
 +^ [[en:cnk:syn:verze10|SYN version 10]] |  2022  |  4.882G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
 +^ [[en:cnk:syn:verze9|SYN version 9]] |  2021  |  4.719G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
 +^ [[en:cnk:syn:verze8|SYN version 8]] |  2019  |  4.499G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts |
 ^ [[en:cnk:syn:verze7|SYN version 7]] |  2018  |  4.255G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | ^ [[en:cnk:syn:verze7|SYN version 7]] |  2018  |  4.255G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts |
 ^ [[en:cnk:syn:verze6|SYN version 6]] |  2017  |  4.033G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | ^ [[en:cnk:syn:verze6|SYN version 6]] |  2017  |  4.033G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts |
Line 31: Line 35:
 ====== Advantages of the SYN corpus ====== ====== Advantages of the SYN corpus ======
  
-  * access to extensive language data (more than billion words)+  * access to extensive language data (more than billion words)
   * possibility to search all the SYN-series corpora at the same time   * possibility to search all the SYN-series corpora at the same time
   * possibility to create subcorpora that correspond to the original corpora   * possibility to create subcorpora that correspond to the original corpora