Nastavení

Corpus SYN

The SYN is a non-reference corpus consisting of texts from all reference synchronic written corpora of the SYN series published up until the given version of the SYN corpus (for example SYN version 3 from the year 2014 includes the corpora SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010 and SYN2013PUB, as can be seen in the following table) and which has been processed by the newest versions of the (tokenization, segmentation, morphological analysis and disambiguation tools).

The SYN corpus is not representative, as the vast majority of the texts belongs to the category of newspapers and magazines, which is due to their easy accessibility.

The SYN corpus is versioned, which means that it is referential in its individual versions. Beginning with version 3, all of its versions remain accessible to users (please keep in mind that the added linguistic information is gradually becoming obsolete, due to the nature of referentiality). The individual versions of the SYN corpus will be published annually starting with version 4, containing additions in the form of current journalistic data. This addition will be assigned a value for the attribute <doc syn>, and this value will be equal to the name of the SYN version in which the given text first appeared.

SYN corpus versions
version year of publication size (no. of words) content
SYN version 5 2017 3.836G SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010, SYN2013PUB, SYN2015, other journalistic texts
SYN version 4 2016 3.626G SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010, SYN2013PUB, SYN2015, other journalistic texts
SYN version 3 2014 2.232G SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010, SYN2013PUB
SYN version 2 2010 1.3G SYN2000, SYN2005, SYN2006PUB, SYN2009PUB, SYN2010
SYN version 1 2007 500M SYN2000, SYN2005, SYN2006PUB

Comparing the SYN series corpora

Before its publication, each of the SYN-series corpora was processed using the newest versions of the tools available at the time of its compilation: tokenization (division of corpus into tokens), segmentation (sentence boundary detection), morphological analysis and disambiguation. At the same time, all the SYN-series corpora were designed as reference corpora, i.e. invariable entities that remain unchanged once published. As a consequence, the results of processing the text with older versions of all the tools are preserved in these corpora, which makes their markup gradually more obsolete. Moreover, it also makes their markup incompatible and further complicates any comparison of data based on them. The improvements in corpus processing made since 2000 are not at all insignificant: many newly recognized word forms including different approach to certain language phenomena, more reliable disambiguation with the rule-based component, completed and unified bibliographical information etc. However, these improvements could not be incorporated into the already published corpora without violating their reference status or introducing a revision control, which would be confusing for most users.

The current SYN as a solution

That is why the SYN corpus was introduced and can be thought of as a sort of “pie”, including the slices which are all the synchronic written corpora, re-processed by state-of-the-art versions of the tools including tokenization, segmentation, morphological analysis and disambiguation.

Reference corpora as subcorpora in SYN

The possibility of searching the revised texts of all the SYN-series corpora joined together is supplemented also by the possibility of creating subcorpora which have the same composition as the original corpora. This is enabled due to the attribute <doc syn>, e.g. a subcorpus corresponding to SYN2005 can be created by applying the condition syn="2005" on the structural attribute <doc>. Of course, this condition may be further combined with other ones that specify required text type, publication date etc. More information can be found in the manual (Czech only). It is therefore possible to use corpus SYN also for work with older representative corpora re-processed by the latest corpus tools. Naturally, there may be found differences between the original corpora and the corresponding new subcorpora caused by different processing. These changes may include not only different lemmatization, but also different frequency of word forms or different number of positions, as these are the results of the tokenization.

Advantages of the SYN corpus

  • access to extensive language data (more than 4.5 billion words)
  • possibility to search all the SYN-series corpora at the same time
  • possibility to create subcorpora that correspond to the original corpora
  • re-processing of the original corpora by continuously improved tools
  • referentiality, i.e. its individual versions are invariable entities that remain unchanged once published

How to cite SYN

Hnátková, M. – Křen, M. – Procházka, P. – Skoumalová, H. (2014): The SYN-series corpora of written Czech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 160–164. Reykjavík: ELRA. ISBN 978-2-9517408-8-4.

Michal Křen, Olga Richterová

Related links