AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:pojmy:lexikalni_bohatost [2024/09/30 18:17] – [Lexical Diversity] alexandrrosenen:pojmy:lexikalni_bohatost [2024/10/18 21:07] (current) – [References] alexandrrosen
Line 2: Line 2:
  
   * InterCorp release 16ud is annotated by following two measures of lexical diversity. They are specified as metadata for each text of sufficient length, for each linguistically annotated language:   * InterCorp release 16ud is annotated by following two measures of lexical diversity. They are specified as metadata for each text of sufficient length, for each linguistically annotated language:
-    * ''lexDivWord'': average number of different word forms per 1000 tokens +    * **lexDivWord**: average number of different word forms per 1000 tokens 
-    * ''lexDivLemma'': average number of different lemmas per 1000 tokens +    * **lexDivLemma**: average number of different lemmas per 1000 tokens 
-  * The measures are based on the type-token ratio. They show the average number of different types (word forms or lemmas) in a moving window of 1000 tokens. Punctuation is ignored.+  * The measures are based on the type-token ratio metrics. They show the average number of different types (word forms or lemmas) in a moving window of 1000 tokens. Punctuation is ignored.
   * If the text has less than 1000 tokens, the measures are not defined and the value of both attributes equals the underscore character (''_'').   * If the text has less than 1000 tokens, the measures are not defined and the value of both attributes equals the underscore character (''_'').
   * For languages which are not linguistically annotated, only the measure counting word forms (''lexDivWord'') is available. For such languages its calculation is based on tokens rather than words, i.e. punctuation is not ignored. This is why the lexDivWord values may be lower than expected in comparison to other texts in linguistically annotated languages.    * For languages which are not linguistically annotated, only the measure counting word forms (''lexDivWord'') is available. For such languages its calculation is based on tokens rather than words, i.e. punctuation is not ignored. This is why the lexDivWord values may be lower than expected in comparison to other texts in linguistically annotated languages. 
   * In KonText, they can be displayed and queried like any other metadata items about a text, such as author or text ID.   * In KonText, they can be displayed and queried like any other metadata items about a text, such as author or text ID.
-  * +  * Average values for all combinations of a language and a text type in InterCorp v16ud are shown in the table on [[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud#detailed_statistics|Detailed statistics]].
   * See also [[en:pojmy:syntakticka_komplexita|measures of syntactic complexity]].   * See also [[en:pojmy:syntakticka_komplexita|measures of syntactic complexity]].
  
Line 18: Line 18:
 ===== References ===== ===== References =====
  
-[[https://docs.google.com/document/d/1nSPzyhT6oHKUDN8A_uYmWrZH6tAmxTH_pUMOdjg01Eg/edit?usp=sharing|InterCorp a Universal Dependenciesnové možnosti výzkumu]] (workshop 20a 273. 2024 v rámci Teoreticko-metodologického semináře Ústavu českého jazyka a teorie komunikace)+Alexandr Rosen (2024): Lexical and syntactic variability 
 +of languages and text genres – a corpus-based study. [[https://www.youtube.com/watch?v=E2ujmqt7Q2E|Recording]] from 14 October 2024: [[https://zil.ipipan.waw.pl/|Natural Language Processing Seminar]] organised by the [[https://zil.ipipan.waw.pl|Linguistic Engineering Group]] at the [[https://ipipan.waw.pl|Institute of Computer Science]] [[https://pan.pl|Polish Academy of Sciences]], [[https://zil.ipipan.waw.pl/seminarium-archiwum?action=AttachFile&do=view&target=2024-10-14.pdf|slides]].
  
-[[https://drive.google.com/file/d/1L9yTjj0bTrGgf8lDcOAsJoJOoeYEoPEm/view?usp=sharing|Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics]] (slides from the seminar at the University of Warsaw, 10 July 2024) + 
 +Alexandr Rosen (2024). [[https://drive.google.com/file/d/1L9yTjj0bTrGgf8lDcOAsJoJOoeYEoPEm/view?usp=sharing|Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics]] (slides from the seminar at the University of Warsaw, 10 July 2024)