This is an old revision of the document!

DIN

DIN (Difference index) is a so called effect-size metric, i.e. measure designed¹⁾) In texts of up to 20 thousand words in length and for analysis of word forms, DIN values in the 75-100 range can be considered to be of interest and they indicate that the unit is probably prominent and can be used as a basis for the interpretation of the entire text. Furthermore, the KWords application offers a whole range of additional information for work with keywords. Apart from the list of keywords and their values it is the data dispersion graph (showing the status of the individual keywords in the text), a graph of so-called keyword links, i.e. relations between keywords in the text and also a concordance of keywords for an analysis of their immediate context. The KWords application was also designed for creating analyses of temporal (or other) data series. If the user inputs more texts into the application (the maximum amount is 20), the so-called multi-analysis regime is activated. This regime analyzes all the inserted texts and the results of the individual analyses are compared based on the DIN.

¹⁾

see Fidler, M. - Cvrček, V.: A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis for the purpose of quantifying the relevance of a difference between values. DIN is implemented in extracting prominent units from a text (keywords) in the KWords tool. ===== How it works ===== The text inserted by the user is first tokenized in a way that is identical to the tokenization of the corpus data. In the second step, the frequencies of all the words in the analyzed text are calculated (except for those which the user has excluded from the analysis with the help of a so-called stop-list, e.g. prepositions, conjunctions, numerals etc.). What follows is a comparison of the frequencies in the text and in the reference corpus. For units which display a statistically significant difference (according to the selected statistical test – chi2 or log-likelihood), the DIN value is subsequently calculated (difference index), which is indicative of how relevant the difference is: $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ where $RelFq(Ttxt)$ is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)$ is the relative frequency of the same phenomenon in the reference corpus. The DIN values, which determine the order of the keywords in the program's output, can reach values from -100 to 100, it being understood that:

a value of -100 means that the given phenomenon does not occur in the analyzed text and is only in the reference corpus (therefore the word is not prominent in the analyzed text)
a value of 0 means that the given phenomenon has approximately the same relative frequency in the analyzed text and in the reference corpus (therefore the word is not prominent in the analyzed text)
a value of 100 means that the word occurs only in the analyzed text (it can therefore be a very prominent word ((In such cases it is important to keep in mind that the complete absence of a word from a reference corpus is unusual and special attention should be paid to it; a word might not occur in a reference corpus e.g. because it is very rare, cited from another language etc.

Trace: • struktura • pokrocile_dotazy • zakovsky • lc • mluveny • arf • ttr • trigram • agregat • din