This is an old revision of the document!
Lexical Diversity
InterCorp release 16ud is annotated by two measures of lexical diversity. They are specified as metadata for each text of sufficient length, for each linguistically annotated language. For languages which are not linguistically annotated, only one of the two measures is available: lexDivWord.1) See also en:pojmy:syntakticka_komplexita.
In KonText, they can be displayed and queried like any other metadata items about a text, such as author or text ID.
- lexDivWord: average number of different word forms per 1000 tokens
- lexDivLemma: average number of different lemmas per 1000 tokens
The measures are based on the type-token ratio. They show the average number of different types (word forms or lemmas) in a moving window of 1000 tokens. If the text has less than 1000 tokens, the measures are not defined and the value of both attributes equals the underscore character (_
).