Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
en:cnk:syn2006pub [2015/10/21 18:57] – created vaclavhorky | en:cnk:syn2006pub [2021/03/16 11:25] (current) – [Structure of the SYN2006PUB corpus] jankocek | ||
---|---|---|---|
Line 1: | Line 1: | ||
~~NOTOC~~ | ~~NOTOC~~ | ||
- | ====== Corpus SYN2006PUB ======= | + | ====== Corpus SYN2006PUB ====== |
+ | |||
+ | The SYN2006PUB is a synchronic corpus of written journalism of 300 million of words (tokens). It contains exclusively journalistic texts from November 1989 to the end of 2004, that is the time period covered by corpora [[en: | ||
<WRAP right 35%> | <WRAP right 35%> | ||
Line 7: | Line 9: | ||
^ ::: ^ Number of positions (tokens) without punctuation | 305 785 705 | | ^ ::: ^ Number of positions (tokens) without punctuation | 305 785 705 | | ||
^ ::: ^ Number of word forms (words) | 2 554 069 | | ^ ::: ^ Number of word forms (words) | 2 554 069 | | ||
- | ^ ::: ^ Number of lemmata | + | ^ ::: ^ Number of lemmas |
^ Structural attributes ^ Number of opera | 8 922 | | ^ Structural attributes ^ Number of opera | 8 922 | | ||
^ ::: ^ Number of documents | 1 218 300 | | ^ ::: ^ Number of documents | 1 218 300 | | ||
Line 15: | Line 17: | ||
^ ::: ^ Publication date | 2006 | | ^ ::: ^ Publication date | 2006 | | ||
</ | </ | ||
- | + | ===== Changes compared | |
- | The SYN2006PUB is a synchronic corpus of written journalism of 300 million of words (tokens). It contains exclusively journalistic texts from November 1989 to the end of 2004, that is the time period covered by corpora [[en: | + | |
The lemmatization and morphological tagging of the SYN2006PUB corpus have been improved in comparison with the [[en: | The lemmatization and morphological tagging of the SYN2006PUB corpus have been improved in comparison with the [[en: | ||
- | In should be stressed that the SYN2006PUB corpus does not claim to be representative in any way. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2006PUB corpus | + | ===== Composition |
+ | In should be stressed that the SYN2006PUB corpus does not claim to be representative in any way. It is clear from the charts below that the corpus composition is balanced neither according to the year of issue, nor according to the titles. The SYN2006PUB corpus will thus be appreciated mainly by users who need to work with large amounts of data. | ||
<WRAP clear></ | <WRAP clear></ | ||
- | |||
[{{: | [{{: | ||
- | [{{: | + | [{{: |
+ | ====== Structure of the SYN2006PUB corpus ====== | ||
+ | Among the [[en: | ||
+ | They can be displayed using the menu item [[en: | ||
- | ====== | + | [{{: |
+ | |||
+ | ====== | ||
<WRAP round tip 70%> | <WRAP round tip 70%> | ||
Line 36: | Line 42: | ||
</ | </ | ||
- | + | ====== | |
- | ====== | + | |
<WRAP round box 49%> | <WRAP round box 49%> | ||
[[en: | [[en: | ||
</ | </ | ||
+ |