AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
en:pojmy:din [2019/09/27 10:22] – [How it works] vaclavcvrceken:pojmy:din [2019/09/27 10:28] vaclavcvrcek
Line 1: Line 1:
 ====== DIN ====== ====== DIN ======
  
-DIN (Difference index) is a so called effect-size metric, i.e. measure designed((see Fidler, M. - Cvrček, V.: {{:pojmy:josl-separat.pdf|A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis}})) for the purpose of quantifying the relevance of a difference between values. DIN is implemented for extracting prominent units from a text (keywords) in the [[en:manualy:kwords|KWords]] tool. +DIN (Difference index) is a so called effect-size metric, i.e. measure designed((see Fidler, M. - Cvrček, V.: {{:pojmy:josl-separat.pdf|A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis}})) for the purpose of quantifying the relevance of a difference between values. DIN is implemented for extracting prominent units from a text (keywords) in the [[en:manualy:kwords|KWords]] tool. 
  
 ===== Significance and relevance ===== ===== Significance and relevance =====
  
-When comparing values (e.g. frequencies of words) we should be interested not only in the statistical significance but also whether the difference under consideration is actualy relevant for the description. Statistical significance can be obtained by several tests (e.g. chi2 test, Fisher's test or log-likelihood test).((It is unimportant for the time being that these test can also be employed as association measures for the extraction of collocastions)) Significance is usualy expressed as a p-value, i.e. the probability that the difference is caused by chance or variation within the data.+When comparing values (e.g. frequencies of words) we should be aware not only of the statistical significance but also whether the difference under consideration is actually relevant for the description. Statistical significance can be obtained by several tests (e.g. chi2 test, Fisher's test or log-likelihood test).((It is unimportant here that these tests can also be employed as association measures for the extraction of collocations)) Significance is usually expressed as a p-value, i.e. the probability that the difference is caused by chance or variation within the data
 + 
 +Even if the difference is significant it does not necessarily entails that it is also relevant for the description. Even a small difference can be significant when there is a lot of measurements available. That is why the statistical significance information is often combined with the effect-size estimation.
  
-Even if the difference is signifiacnt it does not necesarily entails that it is relevant for the description. Even a small difference can be significant when there is a lot of results available. That is why the statistical significance information is often combined with the effect-size. 
 ===== How it works ===== ===== How it works =====
  
-In the model example of extracting prominent words (keywords) from a text we proceed in the following way. For units which display a statistically significant difference, the **DIN** value is subsequently calculated (difference index), which is indicative of how relevant the difference is:+In the model example of extracting prominent words (keywords) from a text we proceed in the following way. For units which display a statistically significant difference, the **DIN** value is subsequently calculated:
  
 $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$
Line 16: Line 17:
 where $RelFq(Ttxt)$ is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)$ is the relative frequency of the same phenomenon in the reference corpus.  where $RelFq(Ttxt)$ is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)$ is the relative frequency of the same phenomenon in the reference corpus. 
  
-The formula takes into account the difference between relative frequencies (numerator) in relation to the frequency level of the items under comparison. This reference frequency level can be represented by their average value, as can be seen in this formula which is equivalent with the above (the coefficient changed from 100 to 50):+The formula takes into account the difference between relative frequencies (numerator) in relation to the frequency level of the items under comparison (denominator). This reference frequency level can be represented by their average value, as can be seen in following formula which is equivalent with the above (the coefficient changed from 100 to 50 in order to keep the DIN values within the desired range):
  
 {{:pojmy:vzorecdin2.png?nolink&350|}} {{:pojmy:vzorecdin2.png?nolink&350|}}