AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
en:pojmy:din [2019/09/27 10:29] – [How it works] Václav Cvrčeken:pojmy:din [2019/10/15 20:59] (current) Václav Cvrček
Line 1: Line 1:
 ====== DIN ====== ====== DIN ======
  
-DIN (Difference index) is a so called effect-size metric, i.e. a measure designed((see Fidler, M. - Cvrček, V.: {{:pojmy:josl-separat.pdf|A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis}})) for the purpose of quantifying the relevance of a difference between values. DIN is implemented for extracting prominent units from a text (keywords) in the [[en:manualy:kwords|KWords]] tool. +The DIN (Difference index) is a so called effect-size metric, i.e. a measure designed((see Fidler, M. - Cvrček, V.: {{:pojmy:josl-separat.pdf|A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis}})) for the purpose of quantifying the relevance differences between values. The DIN is implemented for extracting prominent units from a text (keywords) in the [[en:manualy:kwords|KWords]] tool.
  
 ===== Significance and relevance ===== ===== Significance and relevance =====
  
-When comparing values (e.g. frequencies of words) we should be aware not only of the statistical significance but also whether the difference under consideration is actually relevant for the description. Statistical significance can be obtained by several tests (e.g. chi2 test, Fisher's test or log-likelihood test).((It is unimportant here that these tests can also be employed as association measures for the extraction of collocations)) Significance is usually expressed as a p-value, i.e. the probability that the difference is caused by chance or variation within the data.+When comparing values (e.g. frequencies of words) we should be aware not only of the statistical significancebut also whether the difference under consideration is actually relevant for the description. Statistical significance can be obtained through several tests (e.g. the chi2 test, Fisher's test or log-likelihood test).((It is unimportant to note that these tests can also be employed as association measures for the extraction of collocations)) Significance is usually expressed as a p-value, i.e. the probability that the difference can be attributed to chance or variation within the data.
  
-Even if the difference is significant it does not necessarily entails that it is also relevant for the description. Even a small difference can be significant when there is a lot of measurements available. That is why the statistical significance information is often combined with the effect-size estimation.+Even if the difference is significant, this does not necessarily mean that it is also relevant for the description. Even a small difference can be significant when there is a large number of measurements available. That is why statistical significance information is often combined with the effect-size estimation.
  
 ===== How it works ===== ===== How it works =====
  
-In the model example of extracting prominent words (keywords) from a text we proceed in the following way. For units which display a statistically significant difference, the **DIN** value is subsequently calculated:+In this model example for the extraction of prominent words (keywords) from a text, let us proceed in the following way. For units which display a statistically significant difference, the **DIN** value is calculated in the following way:
  
 $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$
  
-where $RelFq(Ttxt)is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)is the relative frequency of the same phenomenon in the reference corpus. +where //RelFq(Ttxt)// is the relative frequency of the phenomenon in the analyzed text (target text) and //RelFq(RefC)// is the relative frequency of the same phenomenon in the reference corpus. 
  
-The formula takes into account the difference between relative frequencies (numerator) in relation to the frequency level of the items under comparison (denominator). This reference frequency level can be represented by their average value, as can be seen in following formula which is equivalent with the above (the coefficient changed from 100 to 50 in order to keep the DIN values within the desired range):+The formula takes into account the difference between relative frequencies (numerator) in relation to the frequency level of the items under comparison (denominator). This reference frequency level can be represented by their average value, as can be seen in following formulawhich is equivalent to the above (the coefficient has been changed from 100 to 50 in order to keep the DIN values within the desired range):
  
 {{:pojmy:vzorecdin2.png?nolink&350|}} {{:pojmy:vzorecdin2.png?nolink&350|}}
 +
 ===== DIN Values ===== ===== DIN Values =====
  
-The DIN values are designed to reach values from -100 to 100, it being understood that:+The DIN is designed to reach values from -100 to 100, it being understood that:
   * a value of -100 means that the given phenomenon does not occur in the analyzed text and is only in the reference corpus (therefore the word is not prominent in the analyzed text)   * a value of -100 means that the given phenomenon does not occur in the analyzed text and is only in the reference corpus (therefore the word is not prominent in the analyzed text)
   * a value of 0 means that the given phenomenon has approximately the same relative frequency in the analyzed text and in the reference corpus (therefore the word is not prominent in the analyzed text)   * a value of 0 means that the given phenomenon has approximately the same relative frequency in the analyzed text and in the reference corpus (therefore the word is not prominent in the analyzed text)
-  * a value of 100 means that the word occurs only in the analyzed text (it can therefore be a very prominent word ((In such cases it is important to keep in mind that the complete absence of a word from a reference corpus is unusual and special attention should be paid to it; a word might not occur in a reference corpus e.g. because it is very rare, cited from another language etc.)))+  * a value of 100 means that the word occurs only in the analyzed text (it can therefore be a very prominent word ((In such cases it is important to keep in mind that the complete absence of a word from a reference corpus is unusual and special attention should be paid to it; a word might not occur in a reference corpus e.g. because it is very rare, cited from another languageetc.)))
  
-In texts of up to 20 thousand words in length and for analysis of [[en:pojmy:word|word forms]], DIN values in the 75-100 range can be considered to be of interest and they indicate that the unit is probably prominent and can be used as a basis for the interpretation of the entire text.+In texts of up to 20 thousand words in length and for the analysis of [[en:pojmy:word|word forms]], DIN values in the 75-100 range may be considered of interest, as they indicate that the unit is probably prominent and can be used as a basis for the interpretation of the entire text.
  
-Furthermore, the KWords application offers a whole range of additional information for work with keywords. Apart from the list of keywords and their values it is the data dispersion graph (showing the status of the individual keywords in the text), a graph of so-called keyword links, i.e. relations between keywords in the text and also a concordance of keywords for an analysis of their immediate context.+Furthermore, the KWords application offers a whole range of additional information for work with keywords. Apart from the list of keywords and their valuesit also features a data dispersion graph (showing the status of the individual keywords in the text), a graph of so-called keyword links, i.e. relations between keywords in the textand also a concordance of keywords which enables the analysis of their immediate contexts.
  
 The KWords application was also designed for creating analyses of temporal (or other) data series. If the user inputs more texts into the application (the maximum amount is 20), the so-called **multi-analysis** regime is activated. This regime analyzes all the inserted texts and the results of the individual analyses are compared based on the DIN. The KWords application was also designed for creating analyses of temporal (or other) data series. If the user inputs more texts into the application (the maximum amount is 20), the so-called **multi-analysis** regime is activated. This regime analyzes all the inserted texts and the results of the individual analyses are compared based on the DIN.