_
).lexDivWord
) is available. For such languages its calculation is based on tokens rather than words, i.e. punctuation is not ignored. This is why the lexDivWord values may be lower than expected in comparison to other texts in linguistically annotated languages. Alexandr Rosen (2024): Lexical and syntactic variability of languages and text genres – a corpus-based study. Recording from 14 October 2024: Natural Language Processing Seminar organised by the Linguistic Engineering Group at the Institute of Computer Science Polish Academy of Sciences, slides.
Alexandr Rosen (2024). Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics (slides from the seminar at the University of Warsaw, 10 July 2024)