Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:pojmy:lexikalni_bohatost [2024/08/28 23:32] – alexandrrosen | en:pojmy:lexikalni_bohatost [2024/09/08 14:25] (current) – alexandrrosen | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Lexical Diversity ====== | ====== Lexical Diversity ====== | ||
- | InterCorp release 16ud is annotated by two measures of lexical diversity. They are specified as metadata for each text of sufficient length, for each linguistically annotated language. For languages which are not linguistically annotated, only one of the two measures is available: lexDivWord.((Note that for such languages the calculation of the lexDivWord measure is based on tokens rather than words, i.e. punctuation is not ignored. This is why the lexDivWord values may be lower than expected in comparison to other texts in linguistically annotated languages.)) See also [[measures of syntactic complexity|en: | + | InterCorp release 16ud is annotated by two measures of lexical diversity. They are specified as metadata for each text of sufficient length, for each linguistically annotated language. For languages which are not linguistically annotated, only one of the two measures is available: lexDivWord.((Note that for such languages the calculation of the lexDivWord measure is based on tokens rather than words, i.e. punctuation is not ignored. This is why the lexDivWord values may be lower than expected in comparison to other texts in linguistically annotated languages.)) See also [[en: |
Line 12: | Line 12: | ||
The measures are based on the type-token ratio. They show the average number of different types (word forms or lemmas) in a moving window of 1000 tokens. If the text has less than 1000 tokens, the measures are not defined and the value of both attributes equals the underscore character ('' | The measures are based on the type-token ratio. They show the average number of different types (word forms or lemmas) in a moving window of 1000 tokens. If the text has less than 1000 tokens, the measures are not defined and the value of both attributes equals the underscore character ('' | ||
| | ||
+ | ===== References ===== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||