Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:syn2010 [2016/12/11 11:35] – [Composition of SYN2010] veronikapojarova | en:cnk:syn2010 [2016/12/11 16:27] (current) – veronikapojarova |
---|
| |
SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]] and together with them forms a series of synchronic representative corpora that cover three successive periods. | SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora [[en:cnk:SYN2000]] and [[en:cnk:SYN2005]] and together with them forms a series of synchronic representative corpora that cover three successive periods. |
**All corpora contain different texts and are therefore disjoint**. The basic characteristic freatures of the SYN2010 are identical to those of the corpus [[en:SYN2005|SYN2005]], which is predominantly related to the same conception of [[en:pojmy:reprezentativnost|representativeness]] based on the reception of written language and the resulting composition of the corpus. The SYN2010 corpus is [[en:pojmy:lemma|lemmatized]] and [[en:pojmy:tag|morphologically tagged]]. | **All corpora contain different texts and are therefore disjunctive**. The basic characteristic features of the SYN2010 are identical to those of the corpus [[en:cnk:SYN2005|SYN2005]], which is predominantly related to the same conception of [[en:pojmy:reprezentativnost|representativeness]] based on the reception of written language and the resulting composition of the corpus. The SYN2010 corpus is [[en:pojmy:lemma|lemmatized]] and [[en:pojmy:tag|morphologically tagged]]. |
| |
| |
====== Changes compared to the SYN2005 corpus ====== | ====== Changes compared to the SYN2005 corpus ====== |
| |
Compared to the corpus [[en:SYN2005|SYN2005]], the SYN2010 corpus saw **significant improvements in lemmatization** and **[[en:pojmy:tag|morphological tagging]]**; both basically identical to the processing of the [[en:SYN2009PUB|SYN2009PUB]] corpus. Therefore, although [[en:SYN2005|SYN2005]] and SYN2010 do not differ in their understanding of [[en:pojmy:reprezentativnost|representativeness]], **these differences should be taken into account** when comparing their lexical frequencies. | Compared to the corpus [[en:cnk:SYN2005|SYN2005]], the SYN2010 corpus saw **significant improvements in lemmatization** and **[[en:pojmy:tag|morphological tagging]]**; both basically identical to the processing of the [[en:cnk:SYN2009PUB|SYN2009PUB]] corpus. Therefore, although [[en:cnk:SYN2005|SYN2005]] and SYN2010 do not differ in their understanding of [[en:pojmy:reprezentativnost|representativeness]], **these differences should be taken into account** when comparing their lexical frequencies. |
| |
====== Composition of SYN2010 ====== | ====== Composition of SYN2010 ====== |