Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:pojmy:din [2019/09/27 10:22] – [How it works] vaclavcvrcek | en:pojmy:din [2019/10/15 20:59] (current) – vaclavcvrcek | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== DIN ====== | ====== DIN ====== | ||
- | DIN (Difference index) is a so called effect-size metric, i.e. measure designed((see Fidler, M. - Cvrček, V.: {{: | + | The DIN (Difference index) is a so called effect-size metric, i.e. a measure designed((see Fidler, M. - Cvrček, V.: {{: |
===== Significance and relevance ===== | ===== Significance and relevance ===== | ||
- | When comparing values (e.g. frequencies of words) we should be interested | + | When comparing values (e.g. frequencies of words) we should be aware not only of the statistical significance, but also whether the difference under consideration is actually |
+ | |||
+ | Even if the difference is significant, | ||
- | Even if the difference is signifiacnt it does not necesarily entails that it is relevant for the description. Even a small difference can be significant when there is a lot of results available. That is why the statistical significance information is often combined with the effect-size. | ||
===== How it works ===== | ===== How it works ===== | ||
- | In the model example of extracting | + | In this model example |
$$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ | $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ | ||
- | where $RelFq(Ttxt)$ is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)$ is the relative frequency of the same phenomenon in the reference corpus. | + | where //RelFq(Ttxt)// is the relative frequency of the phenomenon in the analyzed text (target text) and //RelFq(RefC)// is the relative frequency of the same phenomenon in the reference corpus. |
- | The formula takes into account the difference between relative frequencies (numerator) in relation to the frequency level of the items under comparison. This reference frequency level can be represented by their average value, as can be seen in this formula which is equivalent | + | The formula takes into account the difference between relative frequencies (numerator) in relation to the frequency level of the items under comparison |
{{: | {{: | ||
Line 22: | Line 23: | ||
===== DIN Values ===== | ===== DIN Values ===== | ||
- | The DIN values are designed to reach values from -100 to 100, it being understood that: | + | The DIN is designed to reach values from -100 to 100, it being understood that: |
* a value of -100 means that the given phenomenon does not occur in the analyzed text and is only in the reference corpus (therefore the word is not prominent in the analyzed text) | * a value of -100 means that the given phenomenon does not occur in the analyzed text and is only in the reference corpus (therefore the word is not prominent in the analyzed text) | ||
* a value of 0 means that the given phenomenon has approximately the same relative frequency in the analyzed text and in the reference corpus (therefore the word is not prominent in the analyzed text) | * a value of 0 means that the given phenomenon has approximately the same relative frequency in the analyzed text and in the reference corpus (therefore the word is not prominent in the analyzed text) | ||
- | * a value of 100 means that the word occurs only in the analyzed text (it can therefore be a very prominent word ((In such cases it is important to keep in mind that the complete absence of a word from a reference corpus is unusual and special attention should be paid to it; a word might not occur in a reference corpus e.g. because it is very rare, cited from another language etc.))) | + | * a value of 100 means that the word occurs only in the analyzed text (it can therefore be a very prominent word ((In such cases it is important to keep in mind that the complete absence of a word from a reference corpus is unusual and special attention should be paid to it; a word might not occur in a reference corpus e.g. because it is very rare, cited from another language, etc.))) |
- | In texts of up to 20 thousand words in length and for analysis of [[en: | + | In texts of up to 20 thousand words in length and for the analysis of [[en: |
- | Furthermore, | + | Furthermore, |
The KWords application was also designed for creating analyses of temporal (or other) data series. If the user inputs more texts into the application (the maximum amount is 20), the so-called **multi-analysis** regime is activated. This regime analyzes all the inserted texts and the results of the individual analyses are compared based on the DIN. | The KWords application was also designed for creating analyses of temporal (or other) data series. If the user inputs more texts into the application (the maximum amount is 20), the so-called **multi-analysis** regime is activated. This regime analyzes all the inserted texts and the results of the individual analyses are compared based on the DIN. |