Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
en:cnk:syn2010 [2016/12/11 11:34] – old revision restored (2015/10/22 21:01) veronikapojarova | en:cnk:syn2010 [2016/12/11 16:26] – [Corpus SYN2010] veronikapojarova | ||
---|---|---|---|
Line 3: | Line 3: | ||
SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora [[en: | SYN2010 is a synchronic representative corpus of written Czech comprising 100 million tokens. It is a sequel to the corpora [[en: | ||
- | **All corpora contain different texts and are therefore | + | **All corpora contain different texts and are therefore |
Line 28: | Line 28: | ||
Some of the fiction texts may have been published earlier, but there is a general rule that the corpus consists mainly of newer texts, whereas the proportion of older texts is decreasing. Compared to the SYN2005 corpus, the lemmatization and morphological tagging of the SYN2010 corpus have been significantly improved; both of them correspond with the processing of the [[en: | Some of the fiction texts may have been published earlier, but there is a general rule that the corpus consists mainly of newer texts, whereas the proportion of older texts is decreasing. Compared to the SYN2005 corpus, the lemmatization and morphological tagging of the SYN2010 corpus have been significantly improved; both of them correspond with the processing of the [[en: | ||
- | <WRAP clear></ | + | ===== The general composition of SYN2010 ===== |
[{{: | [{{: |