Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:manualy:kwords [2016/11/20 17:13] – [Obrázky aplikace] veronikapojarova | en:manualy:kwords [2023/11/13 10:01] (current) – [KWords] vaclavcvrcek |
---|
====== KWords ====== | ====== KWords ====== |
| |
{{ kurz:kwords-logo.png?nolink&200|}} | {{ :manualy:kwords_logo_v2.png?nolink&|}} |
| |
The KWords application is used for the analysis of texts based on their comparison with the general usage ([[en:pojmy:referencni|reference]] corpus). Its aim is to identify so-called [[en:pojmy:keyword|keywords]], which are [[en:pojmy:word|word forms]] appearing in the inspected text with a significantly higher frequency than in the reference corpus which should reflect the common usage. These key words serve as a basis for textual analysis and interpretation. | The KWords application is used for the analysis of texts based on their comparison with the general usage ([[en:pojmy:referencni|reference]] corpus). Its aim is to identify so-called [[en:pojmy:keyword|keywords]], which are [[en:pojmy:word|word forms]] or [[en:pojmy:lemma|lemmas]] appearing in the inspected text with a significantly higher frequency than in the reference corpus which should reflect the common usage. These key words serve as a basis for textual analysis and interpretation. |
| |
KWords is an online application (the only thing we need to use it is a web browser) and it is accessible without [[en:kurz:zaciname|registration]] to all users at **[[http://kwords.korpus.cz|kwords.korpus.cz]]**. | KWords is an online application (the only thing we need to use it is a web browser) and it is accessible without [[en:kurz:zaciname|registration]] to all users at **[[http://kwords.korpus.cz|kwords.korpus.cz]]**. |
| |
The KWords applcation was originally created for the purpose of analyzing political speeches, and is being developed further in cooperation with [[http://www.brown.edu|Brown University]]. It is currently implemented for the analysis of Czech and English texts of up to approx. 20 thousand words. | The first version of KWords was developed for the purpose of analyzing political speeches in collaboration with [[http://www.brown.edu|Brown University]]. The second version was developed as part of the [[https://threat-defuser.org|Threat-defuser project]]. This version supports more than 30 languages and allows keyword analysis as well as keymorph analysis.((see Fidler, M. - Cvrček, V.: [[https://doi.org/10.1515/cllt-2016-0073|Keymorph analysis, or how morphosyntax informs discourse]]. Corpus Linguistics and Linguistic Theory. 15/1, p. 39–70.)) |
| |
===== Prominent units ===== | ===== Prominent units ===== |
The identification of [[en:pojmy:keyword|keywords]] takes place based on a comparison of each word's relative [[en:pojmy:frekvence|frequency]] in the given text with the same word's relative frequency in the reference corpus. Several tests are used to determine the statistical significance of the differences, two of which are implemented in KWords: [[en:pojmy:chi2|chi2]] and [[en:pojmy:loglikelihood|log-likelihood]]. Keywords in the analyzed text are marked <fc #ff0000>red</fc>. | The identification of [[en:pojmy:keyword|keywords]] takes place based on a comparison of each word's relative [[en:pojmy:frekvence|frequency]] in the given text with the same word's relative frequency in the reference corpus. Several tests are used to determine the statistical significance of the differences, two of which are implemented in KWords: [[en:pojmy:chi2|chi2]] and [[en:pojmy:loglikelihood|log-likelihood]]. Keywords in the analyzed text are marked <fc #ff0000>red</fc>. |
| |
The results of the keyword analysis are always influenced by the choice of reference corpus, which should be seen as a neutral language background with which we compare the analyzed text. For example, when analyzing the New Year speeches of the last Communist president G. Husák, we notice that compared to current usage there is a high frequency of words such as //socialistický// (socialistic), //soudružky// (comrades) etc., but this i not the case when compared to a reference corpus from the same period. Currently, the following reference corpora can be used in the KWords application: | The results of the keyword analysis are always influenced by the choice of reference corpus, which should be seen as a neutral language background with which we compare the analyzed text. For example, when analyzing the New Year speeches of the last Communist president G. Husák, we notice that compared to current usage there is a high frequency of words such as //socialistický// (socialistic), //soudružky// (comrades) etc., but this i not the case when compared to a reference corpus from the same period. Currently, the [[en:cnk:intercorp|InterCorp]] parallel corpus is available for all languages as a reference corpus. |
* for Czech | |
* [[en:cnk:syn2015|SYN2015]] | |
* [[en:cnk:syn2010|SYN2010]] | |
* [[en:cnk:syn2005|SYN2005]] | |
* diakon19 -- ad hoc corpus created from available data in the [[en:cnk:struktura#diachronnikorpus|diachronic part of the CNC]] covering the 19th Century | |
* totalita -- a corpus of ideological texts and official journalism from the period of Communist totalitarianism | |
* Oral -- the [[en:cnk:oral2006|Oral2006]] and [[en:cnk:oral2008|Oral2008]] corpora | |
* pub -- the journalistic section of the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]] and [[en:cnk:syn2010|SYN2010]] | |
* bel -- the fiction section of the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]] and [[en:cnk:syn2010|SYN2010]] | |
* odb -- specialized literature from the corpora [[en:cnk:syn2000|SYN2000]], [[en:cnk:syn2005|SYN2005]] and [[en:cnk:syn2010|SYN2010]] | |
* for English | |
* BNC -- [[http://www.natcorp.ox.ac.uk|British National Corpus]] | |
* COCA -- [[http://www.wordfrequency.info/100k.asp|Corpus of Contemporary American English]] | |
* InterCorp-EN v8 -- the English section of the parallel corpus [[en:cnk:intercorp|InterCorp]] | |
==== Thematic concentration ==== | ==== Thematic concentration ==== |
| |
Words which are highlighted in <html><span style="background-color: yellow">yellow</span></html> in the analyzed text are those which bear thematic concentration (TC words). They are not identified through comparison with a reference corpus, but only by their placement in the frequency distribution of the units in the analyzed text: when we arrange all the words in the text from those which are most frequent and down to words which appear only once, we get a so-called [[en:pojmy:zipf|Zipf]] distribution. In this distribution we are looking for a so-called //h// point, for which we can say that rank = frequency (e.g. 32nd most frequent word has a frequency of 32 occurrences). All autosemantic words (bearing meaning independent of context) above this point (i.e. in our case with a frequency higher than 32) we label thematic concentration. More details and a specific application of this approach to literary texts can be found for example in the article of [[http://www.cechradek.cz/publ/2013_Davidova_Cech_Tematicka_koncentrace_Jehlicka_NR.pdf|R. Čech]] (2013). | Words which are highlighted in yellow in the analyzed text are those which bear thematic concentration (TC words). They are not identified through comparison with a reference corpus, but only by their placement in the frequency distribution of the units in the analyzed text: when we arrange all the words in the text from those which are most frequent and down to words which appear only once, we get a so-called [[en:pojmy:zipf|Zipf]] distribution. In this distribution we are looking for a so-called //h// point, for which we can say that rank = frequency (e.g. 32nd most frequent word has a frequency of 32 occurrences). All autosemantic words (bearing meaning independent of context) above this point (i.e. in our case with a frequency higher than 32) we label thematic concentration. More details and a specific application of this approach to literary texts can be found for example in the article of [[http://www.cechradek.cz/publ/2013_Davidova_Cech_Tematicka_koncentrace_Jehlicka_NR.pdf|R. Čech]] (2013). |
| |
===== Princip fungování ===== | ===== How it works ===== |
| |
Text vložený uživatelem se nejprve [[pojmy:token|roztokenizuje]] způsobem, který je identický s tokenizací korpusových dat. V druhém kroku je spočtena frekvence všech slov v analyzovaném textu (s výjimkou těch, které uživatel z analýzy vyloučí prostřednictvím tzv. stop-listu, např. předložky, spojky, čísla apod.). Následuje porovnání frekvencí v textu a v referenčním korpusu. Pro jednotky, u nichž byl zaznamenán statisticky signifikantní rozdíl (podle zvoleného statistického testu -- [[pojmy:chi2|chi2]] či [[pojmy:loglikelihood|log-likelihood]]), je dále vypočítána hodnota **DIN** (difference index) vypovídající o relevanci daného rozdílu: | The text inserted by the user is first [[en:pojmy:token|tokenized]] in a way that is identical to the tokenization of the corpus data. In the second step, the frequencies of all the words in the analyzed text are calculated (except for those which the user has excluded from the analysis with the help of a so-called stop-list, e.g. prepositions, conjunctions, numerals etc.). What follows is a comparison of the frequencies in the text and in the reference corpus. For units which display a statistically significant difference (according to the selected statistical test -- [[en:pojmy:chi2|chi2]] or [[en:pojmy:loglikelihood|log-likelihood]]), the **DIN** value is subsequently calculated (difference index), which is indicative of how relevant the difference is: |
| |
$$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ | $$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ |
| |
kde $RelFq(Ttxt)$ je relativní frekvence jevu ve zkoumaném textu (target text) a $RelFq(RefC)$ je relativní frekvence téhož jevu v referenčním korpusu. Hodnoty DIN, podle nichž jsou klíčová slova ve výpisu programu seřazena, mohou dosahovat hodnot od -100 do 100, přičemž platí, že: | where $RelFq(Ttxt)$ is the relative frequency of the phenomenon in the analyzed text (target text) and $RelFq(RefC)$ is the relative frequency of the same phenomenon in the reference corpus. The DIN values, which determine the order of the keywords in the program's output, can reach values from -100 to 100, it being understood that: |
* hodnota -100 znamená, že daný jev se ve zkoumaném textu nevyskytuje, je pouze v referenčním korpusu (slovo tedy není ve zkoumaném textu prominentní) | * a value of -100 means that the given phenomenon does not occur in the analyzed text and is only in the reference corpus (therefore the word is not prominent in the analyzed text) |
* hodnota 0 znamená, že daný jev má zhruba stejnou relativní frekvenci ve zkoumaném textu i v referenčním korpusu (slovo tedy není ve zkoumaném textu prominentní) | * a value of 0 means that the given phenomenon has approximately the same relative frequency in the analyzed text and in the reference corpus (therefore the word is not prominent in the analyzed text) |
* hodnota 100 značí, že slovo se vyskytuje pouze ve zkoumaném textu (může se tedy jednat o velmi prominentní slovo((V takovýchto případech je třeba mít na paměti, že absence slova v referenčním korpusu je situace zvláštní, která je vždy hodna speciálního pozoru; slovo se v referenčním korpusu nemusí vyksytovat např. proto, že jde o velmi řídký jev, zvlaštní proprium, citátové slovo z jiného jazyka apod.))) | * a value of 100 means that the word occurs only in the analyzed text (it can therefore be a very prominent word ((In such cases it is important to keep in mind that the complete absence of a word from a reference corpus is unusual and special attention should be paid to it; a word might not occur in a reference corpus e.g. because it is very rare, cited from another language etc.))) |
| |
V textech o rozsahu do 20 tisíc slov a při analýze [[pojmy:word|slovních tvarů]] je možné považovat hodnoty DIN v rozmezí 75-100 za velmi zajímavé a značí, že se jedná pravděpodobně o prominentní jednotku, která může dobře posloužit jako východisko pro interpretaci celého textu. | In texts of up to 20 thousand words in length and for analysis of [[en:pojmy:word|word forms]], DIN values in the 75-100 range can be considered to be of interest and they indicate that the unit is probably prominent and can be used as a basis for the interpretation of the entire text. |
| |
Aplikace KWords dále nabízí celou řadu doplňujících informací pro práci s klíčovými slovy. Vedle seznamu klíčových slov spolu s jejich hodnotami je to především graf disperze dat (ukazující postavení jednotlivých klíčových slov v textu), graf tzv. keyword links, tj. vztahů mezi klíčovými slovy v textu a také konkordanci klíčových slov pro analýzu jejich bezprostředního okolí. | Furthermore, the KWords application offers a whole range of additional information for work with keywords. Apart from the list of keywords and their values it is the data dispersion graph (showing the status of the individual keywords in the text), a graph of so-called keyword links, i.e. relations between keywords in the text and also a concordance of keywords for an analysis of their immediate context. |
| |
Aplikace KWords byla navržena také pro vytváření analýz časových (nebo jiných) sérií dat. Pokud uživatel vloží na vstupu do aplikace víc textů (maximální množství je 20), aktivuje režim tzv. **multi-analýzy**. V něm jsou analyzovány všechny vložené texty a výsledky z jednotlivých analýz porovnány na základě DIN. | The KWords application was also designed for creating analyses of temporal (or other) data series. If the user inputs more texts into the application (the maximum amount is 20), the so-called **multi-analysis** regime is activated. This regime analyzes all the inserted texts and the results of the individual analyses are compared based on the DIN. |
===== Application images ===== | ===== Application images ===== |
| |
[{{:kurz:kwords-vstup.png?direct&300|Inputting text into KWords}}] | {{:manualy:kwords2.png?direct&400 |}} |
[{{:kurz:kwords-vystup.png?direct&300|Analyzed text with highlighted keywords}}] | {{:manualy:kwords2_nastaveni.png?direct&400 |}} |
[{{:kurz:kwords-tab.png?direct&300|List of keywords}}] | {{:manualy:kwords2_klicova_slova.png?direct&400|}} |
[{{:kurz:kwords-distrib.png?direct&300|Distribution of keywords throughout the analyzed text}}] | {{:manualy:kwords2_graf.png?direct&400 |}} |
[{{:kurz:kwords-links.png?direct&300|Mutual relations between keywords (keyword links)}}] | {{:manualy:kwords2_distribuce.png?direct&400 |}} |
[{{:kurz:kwords-comp.png?direct&300|Comparison of several speeches -- multi-analysis}}] | {{:manualy:kwords2_konkordance.png?direct&400 |}} |
| {{:manualy:kwords2_links.png?direct&400|}} |
| |
| ===== Application images (previous version)===== |
| |
| [{{:kurz:kwords-vstup.png?direct&400 |Inputting text into KWords}}] |
| [{{:kurz:kwords-vystup.png?direct&400 |Analyzed text with highlighted keywords}}] |
| [{{:kurz:kwords-tab.png?direct&400|List of keywords}}] |
| [{{:kurz:kwords-distrib.png?direct&400 |Distribution of keywords throughout the analyzed text}}] |
| [{{:kurz:kwords-links.png?direct&400 |Mutual relations between keywords (keyword links)}}] |
| [{{:kurz:kwords-comp.png?direct&400|Comparison of several speeches -- multi-analysis}}] |
| |
==== Related links ==== | ==== Related links ==== |
| |
<WRAP round box 49%> | <WRAP round box 49%> |
[[en:manualy:kontext:index|KonText interface]] • [[syd|SyD]] • [[morfio|Morfio]] • [[treq|Treq]] • [[en:pojmy:korpusovy_manazer|Corpus manager]] • [[en:pojmy:nastroje|Corpus tools]] | [[en:manualy:kontext:index|KonText interface]] • [[en:manualy:syd|SyD]] • [[en:manualy:morfio|Morfio]] • [[en:manualy:treq|Treq]] • [[en:pojmy:korpusovy_manazer|Corpus manager]] • [[en:pojmy:nastroje|Corpus tools]] |
</WRAP> | </WRAP> |