Differences

This shows you the differences between two versions of the page.

--- en:pojmy:din [2019/09/27 10:07] – vaclavcvrcek
+++ en:pojmy:din [2019/09/27 10:29] – [How it works] vaclavcvrcek
@@ Line 1: / Line 1: @@
 ====== DIN ======
-DIN (Difference index) is a so called effect-size metric, i.e. measure designed((see Fidler, M. - Cvrček, V.: {{:pojmy:josl-separat.pdf|A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis}})) for the purpose of quantifying the relevance of a difference between values. DIN is implemented in extracting prominent units from a text (keywords) in the [[en:manualy:kwords|KWords]] tool.
+DIN (Difference index) is a so called effect-size metric, i.e. a measure designed((see Fidler, M. - Cvrček, V.: {{:pojmy:josl-separat.pdf|A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis}})) for the purpose of quantifying the relevance of a difference between values. DIN is implemented for extracting prominent units from a text (keywords) in the [[en:manualy:kwords|KWords]] tool.
+===== Significance and relevance =====
+When comparing values (e.g. frequencies of words) we should be aware not only of the statistical significance but also whether the difference under consideration is actually relevant for the description. Statistical significance can be obtained by several tests (e.g. chi2 test, Fisher's test or log-likelihood test).((It is unimportant here that these tests can also be employed as association measures for the extraction of collocations)) Significance is usually expressed as a p-value, i.e. the probability that the difference is caused by chance or variation within the data.
+Even if the difference is significant it does not necessarily entails that it is also relevant for the description. Even a small difference can be significant when there is a lot of measurements available. That is why the statistical significance information is often combined with the effect-size estimation.
 ===== How it works =====
-The text inserted by the user is first [[en:pojmy:token|tokenized]] in a way that is identical to the tokenization of the corpus data. In the second step, the frequencies of all the words in the analyzed text are calculated (except for those which the user has excluded from the analysis with the help of a so-called stop-list, e.g. prepositions, conjunctions, numerals etc.). What follows is a comparison of the frequencies in the text and in the reference corpus. For units which display a statistically significant difference (according to the selected statistical test -- [[en:pojmy:chi2|chi2]] or [[en:pojmy:loglikelihood|log-likelihood]]), the **DIN** value is subsequently calculated (difference index), which is indicative of how relevant the difference is:
+In the model example of extracting prominent words (keywords) from a text we proceed in the following way. For units which display a statistically significant difference, the **DIN** value is subsequently calculated:
 $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$
-where $RelFq(Ttxt)$ is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)$ is the relative frequency of the same phenomenon in the reference corpus. The DIN values, which determine the order of the keywords in the program's output, can reach values from -100 to 100, it being understood that:
+where $RelFq(Ttxt)$ is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)$ is the relative frequency of the same phenomenon in the reference corpus.
+The formula takes into account the difference between relative frequencies (numerator) in relation to the frequency level of the items under comparison (denominator). This reference frequency level can be represented by their average value, as can be seen in following formula which is equivalent with the above (the coefficient changed from 100 to 50 in order to keep the DIN values within the desired range):
+{{:pojmy:vzorecdin2.png?nolink&350|}}
+===== DIN Values =====
+The DIN values are designed to reach values from -100 to 100, it being understood that:
   * a value of -100 means that the given phenomenon does not occur in the analyzed text and is only in the reference corpus (therefore the word is not prominent in the analyzed text)
   * a value of 0 means that the given phenomenon has approximately the same relative frequency in the analyzed text and in the reference corpus (therefore the word is not prominent in the analyzed text)

Trace: • syn2005 • ud • morphology2 • online • czesl-plain • prehled_pozic_1_a_2 • index • speeches • subkorpus • uvod

Differences

Search

Navigation

Print/export

Tools

Languages

Licence