AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:syn [2017/04/25 18:42] – [Corpus SYN] Michal Křenen:cnk:syn [2023/12/29 12:21] (current) Michal Křen
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
  
-====== Corpus SYN ======+====== SYN corpus ======
  
-The **SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools).+**SYN** is a non-reference corpus consisting of texts from all reference [[en:pojmy:synchronni| synchronic]] [[en:pojmy:psany|written]] corpora of the SYN series published up until the given version of the SYN corpus (for example [[en:cnk:syn:verze3|SYN version 3]] from the year 2014 includes the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006pub|SYN2006PUB]], [[en:cnk:syn2009pub|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] and [[en:cnk:syn2013pub|SYN2013PUB]], as can be seen in the following table) and which has been processed by the newest versions of the ([[en:pojmy:token|tokenization]], [[en:pojmy:segmentace|segmentation]], [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:desambiguace|disambiguation]] tools).
  
 The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility. The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility.
Line 11: Line 11:
 ^ <fs medium> SYN corpus versions</fs> ^^^^ ^ <fs medium> SYN corpus versions</fs> ^^^^
 ^ version ^ year of publication ^ size (no. of words) ^ content ^ ^ version ^ year of publication ^ size (no. of words) ^ content ^
-^ [[en:cnk:syn:verze5|SYN version 5]] |  2017  | FIXME  3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | +^ [[en:cnk:syn:verze12|SYN version 12]] |  2023  |  5.175G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
-^ [[en:cnk:syn:verze4|SYN version 4]] |  2016  |  3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other other journalistic texts |+^ [[en:cnk:syn:verze11|SYN version 11]] |  2022  |  5.032G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
 +^ [[en:cnk:syn:verze10|SYN version 10]] |  2022  |  4.882G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
 +^ [[en:cnk:syn:verze9|SYN version 9]] |  2021  |  4.719G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], [[en:cnk:syn2020|SYN2020]], other journalistic texts | 
 +^ [[en:cnk:syn:verze8|SYN version 8]] |  2019  |  4.499G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | 
 +^ [[en:cnk:syn:verze7|SYN version 7]] |  2018  |  4.255G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | 
 +^ [[en:cnk:syn:verze6|SYN version 6]] |  2017  |  4.033G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | 
 +^ [[en:cnk:syn:verze5|SYN version 5]] |  2017  |  3.836G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts | 
 +^ [[en:cnk:syn:verze4|SYN version 4]] |  2016  |  3.626G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]], [[en:cnk:syn2015|SYN2015]], other journalistic texts |
 ^ [[en:cnk:syn:verze3|SYN version 3]] |  2014  |  2.232G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]] | ^ [[en:cnk:syn:verze3|SYN version 3]] |  2014  |  2.232G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]], [[en:cnk:syn2013PUB|SYN2013PUB]] |
 ^ SYN version 2 |  2010  |  1.3G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] | ^ SYN version 2 |  2010  |  1.3G | [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]], [[en:cnk:syn2006PUB|SYN2006PUB]], [[en:cnk:syn2009PUB|SYN2009PUB]], [[en:cnk:syn2010|SYN2010]] |
Line 21: Line 28:
  
 ==== The current SYN as a solution ==== ==== The current SYN as a solution ====
-That is why the SYN corpus was introduced and can be thought of as a sort of "pie", including the slices which are all the synchronic written corpora, re-processed by state-of-the-art versions of the tools including tokenization, segmentation, morphological analysis and disambiguation (which is on the same level as in [[en:cnk:SYN2013PUB]]).+That is why the SYN corpus was introduced and can be thought of as a sort of "pie", including the slices which are all the synchronic written corpora, **re-processed by state-of-the-art versions of the tools** including tokenization, segmentation, morphological analysis and disambiguation. 
  
 ==== Reference corpora as subcorpora in SYN ==== ==== Reference corpora as subcorpora in SYN ====
-The possibility of searching the revised texts of all the SYN-series corpora joined together is supplemented also by the possibility of creating subcorpora which have the same composition as the original corpora. This is enabled due to the attribute ''<doc syn>'', e.g. a subcorpus corresponding to SYN2005 can be created by applying the condition ''syn=<nowiki>"</nowiki>2005<nowiki>"</nowiki>'' on the structural attribute ''<doc>''. Of course, this condition may be further combined with other ones that specify required text type, publication date etc. More information can be found in the [[kurz:pokrocile_dotazy|manual]] (Czech only). **It is therefore possible to use corpus SYN also for work with older representative corpora re-processed by the latest corpus tools.** Naturally, there may be found differences between the original corpora and the corresponding new subcorpora caused by different processing. These changes may include not only different [[en:pojmy:lemma|lemmatization]], but also different frequency of word forms or different number of positions, as these are the results of the tokenization.+The possibility of searching the revised texts of all the SYN-series corpora joined together is supplemented also by the possibility of creating subcorpora which have the same composition as the original corpora. This is enabled due to the attribute ''<doc syn>'', e.g. a subcorpus corresponding to SYN2005 can be created by applying the condition ''syn=<nowiki>"</nowiki>2005<nowiki>"</nowiki>'' on the structural attribute ''<doc>''. Of course, this condition may be further combined with other ones that specify required text type, publication date etc. More information can be found in the [[kurz:pokrocile_dotazy|manual]] (Czech only). It is therefore possible to use corpus SYN also for **work with older representative corpora** re-processed by the latest corpus tools. Naturally, there may be found differences between the original corpora and the corresponding new subcorpora caused by different processing. These changes may include not only different [[en:pojmy:lemma|lemmatization]], but also different frequency of word forms or different number of positions, as these are the results of the tokenization.
  
-As a **non-reference** corpus, the SYN corpus may be modified in the future for various reasons, e.g. correction of errors, significant improvement of morphological analysis and/or disambiguation, or inclusion of future (so far only planned) synchronic written corpora. Such an update will therefore be irregular; however, **it will not happen more often than once a year**. The SYN corpus will thus still retain its character as a //non-reference unification of all the SYN-series corpora consistently re-processed with state-of-the-art versions of available tools.// 
 ====== Advantages of the SYN corpus ====== ====== Advantages of the SYN corpus ======
  
-  * access to extensive language data (more than 4.5 billion words) +  * access to extensive language data (more than 5 billion words) 
-  * it is possible to search all the SYN-series corpora at the same time +  * possibility to search all the SYN-series corpora at the same time 
-  * it is possible to create subcorpora that correspond to the original corpora+  * possibility to create subcorpora that correspond to the original corpora
   * re-processing of the original corpora by continuously improved tools   * re-processing of the original corpora by continuously improved tools
 +  * referentiality, i.e. its individual versions are invariable entities that remain unchanged once published
  
 ====== How to cite SYN ====== ====== How to cite SYN ======
Line 41: Line 48:
  
  
- --- //Michal Křen, Olga Richterová//+ --- //Michal Křen, Olga Richterová, Michal Škrabal//
  
 ====== Related links ====== ====== Related links ======