AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:manualy:kontext:frekvence [2022/05/19 11:37] Jan Koceken:manualy:kontext:frekvence [2023/03/13 14:17] (current) – [Custom settings of frequency distribution] David Lukeš
Line 3: Line 3:
 In the [[en:manualy:kontext:index|KonText interface ]] menu the item //Frequency// includes the function for creating **frequency distribution**. With this function it is possible to get an overview of the [[en:pojmy:typ|types]] (e.g. of different words) in the search results, along with their frequency. If we wish to find all of the nouns in the genitive case and in the plural form, with this function we can determine which [[en:pojmy:word|words]] occur in this particular case and number and how frequently. It is also possible to use frequency distribution to determine the frequency of both the previous and the following units, calculate [[en:pojmy:lemma|lemmas]] in the [[en:pojmy:konkordance|concordance]] or determine the distribution of the wanted phenomenon across different text types and their groups (according to the [[en:pojmy:genre|genre]], [[en:pojmy:txtype|txtype]] etc.). In the [[en:manualy:kontext:index|KonText interface ]] menu the item //Frequency// includes the function for creating **frequency distribution**. With this function it is possible to get an overview of the [[en:pojmy:typ|types]] (e.g. of different words) in the search results, along with their frequency. If we wish to find all of the nouns in the genitive case and in the plural form, with this function we can determine which [[en:pojmy:word|words]] occur in this particular case and number and how frequently. It is also possible to use frequency distribution to determine the frequency of both the previous and the following units, calculate [[en:pojmy:lemma|lemmas]] in the [[en:pojmy:konkordance|concordance]] or determine the distribution of the wanted phenomenon across different text types and their groups (according to the [[en:pojmy:genre|genre]], [[en:pojmy:txtype|txtype]] etc.).
  
-Frequency distribution includes both custom (general) settings and **quick selection** (both are available at the second level of menu)+Frequency distribution includes both custom (general) settings and **quick selection** (both are available at the second level of menu).
-  - **Lemmas** - assesses the query ([[en:pojmy:kwic|KWIC]]) and lists all of the different types of lemmas (attribute [[en:pojmy:lemma|lemma]]), along with their frequency  ((This option is available only for the corpora that have been lemmatized)) +
-  - **Node forms [A=a]** - assesses the query ([[en:pojmy:kwic|KWIC]]) and lists all of the different forms (attribute [[en:pojmy:word|word]] case insensitive), along with their frequency  +
-  - **Doc IDs** - assesses the whole [[en:pojmy:konkordance|concordance]] and lists the text names ([[en:pojmy:atributy_strukturni|structural attributes]] ''name'') in which the wanted phenomenon occurs, along with the frequency of this phenomenon in the individual texts  +
-  - **Text types** - assesses the whole [[en:pojmy:konkordance|concordance]] and lists an overview of  the structural attributes ((The list of the structural attributes might vary depending on the type of corpus usedAccordingly, the result generated by this option might vary.)) which apply to the text type ([[en:pojmy:atributy_strukturni|structural attributes]] ''[[en:pojmy:txtype_group|txtype_group]]'', ''[[en:pojmy:txtype|txtype]]'', ''[[en:pojmy:medium|med]]'', ''[[en:pojmy:srclang|srclang]]''), along with their frequency (the meaning of individual abbreviations is available at [[en:seznamy:index#zkratky_a_kody|the list of abbreviations and codes]])+
  
 The function **[[en:manualy:kontext:novy_dotaz#seznam_slov|New query → Word list]]** which generally applies to the entire corpus (not only to the specific concordance) allows for similar functionality. The function **[[en:manualy:kontext:novy_dotaz#seznam_slov|New query → Word list]]** which generally applies to the entire corpus (not only to the specific concordance) allows for similar functionality.
  
-===== Custom settings of frequency distribution  ===== 
  
-The form which appears after clicking on the option **Frequency distribution → Custom** consists of two sections:+===== Quick selection of frequency distributions =====
  
 +==== Lemmas ====
 +
 +Assesses the query ([[en:pojmy:kwic|KWIC]]) and lists all of the different types of lemmas (attribute [[en:pojmy:lemma|lemma]]), along with their frequency  ((This option is available only for the corpora that have been lemmatized)).
 +
 +==== Node forms [A=a] ====
 +
 +Assesses the query ([[en:pojmy:kwic|KWIC]]) and lists all of the different forms (attribute [[en:pojmy:word|word]] case insensitive), along with their frequency.
 +
 +==== Doc IDs ====
 +
 +Assesses the whole [[en:pojmy:konkordance|concordance]] and lists the text names ([[en:pojmy:atributy_strukturni|structural attributes]] ''name'') in which the wanted phenomenon occurs, along with the frequency of this phenomenon in the individual texts.
 +
 +==== Text types ====
 + 
 +Assesses the whole [[en:pojmy:konkordance|concordance]] and lists an overview of  the structural attributes ((The list of the structural attributes might vary depending on the type of corpus used. Accordingly, the result generated by this option might vary.)) which apply to the text type ([[en:pojmy:atributy_strukturni|structural attributes]] ''[[en:pojmy:txtype_group|txtype_group]]'', ''[[en:pojmy:txtype|txtype]]'', ''[[en:pojmy:medium|med]]'', ''[[en:pojmy:srclang|srclang]]''), along with their frequency (the meaning of individual abbreviations is available at [[en:seznamy:index#zkratky_a_kody|the list of abbreviations and codes]]).
 +
 +===== Frequency list =====
 +
 +The following example shows how to use frequency list when  working with the [[en:cnk:syn2020|SYN2020]] corpus to search for a query of [[en:pojmy:lemma|lemma]] //dřevo// (''[lemma=%%"%%dřevo%%"%%]''): Frequency list of the words of lemma //dřevo// regardless of case and with a zero frequency limit (using a preset option  **Node forms [A=a]**).
 +
 +==== Table view ====
 +
 +[{{ :en::manualy:kontext:fqdist-word-drevo_tab_en.png?direct&400|Frequency list of the words of lemma //dřevo// (including representations of confidency intervals) }}]
 +
 +The default display is a table showing the absolute and relative frequencies for each item (including the option to display confidence intervals).  
 +
 +Different kinds of information will appear by every [[en:pojmy:word|word]] (attribute) displayed in the frequency list of lemma //dřevo//. The basic information is located in the frequency column and displays absolute frequency of a given item in the searched concordance (if the concordance was altered in some way before the frequency list was submitted -- e.g. with filters -- the frequency list will be altered accordingly). All items with at least one occurrence will be displayed in the list. If we want to narrow down the list, we can set **Minimum Frequency** to a value that suits the specific situation.
 +
 +The [[en:pojmy:ipm|i.p.m.]] column next to the absolute frequency column expresses the **relative frequency** of the studied phenomena relative to the total size of the corpus. In our case, the form //dřeva// appears in the corpus [[en:cnk:syn2020|SYN2020]] with an absolute frequency of 5,712, which represents 46.89 occurrences per million words (i.p.m.).
 +
 +For both absolute and relative frequency values, an additional option can be used to display the values of **confidence intervals**, i.e. the ranges within which the given frequencies (with probability at a specified **confidence level**) would occur in other, similarly constructed corpora of comparable size. The confidence level is set at 95% and can be changed to 99% or 90%.
 +
 +In the list to the left from the word, there are located links **p/n** which can be used for a quick display of positive or negative [[en:manualy:kontext:filtr|filter]]. By clicking on the **p** in the line displaying frequency for the word //dřeva//, we filter out this form from the current concordance, in the same way when **n** is activated, all of the occurrences of the given form will be eliminated from the current concordance.
 +
 +After clicking on the heading of the column, the table will automatically be rearranged according to the selected column. This way, it is possible to create a list that is arranged alphabetically (in addition to the usual list arranged according to the frequency).
 +
 +The **Share the table** function (the link is placed in the row above the table) generates a permanent link to the table, which can be sent directly from the form window to the specified e-mail address or later mentioned in an article, study, etc.
 +
 +==== Chart view ====
 +
 +The graphical display allows you to visualize the information presented in the previous section (absolute and relative frequencies of items with their confidence intervals) in the form of two types of graphs: either a horizontal **bar chart** or a "**word cloud**" graph.
 +
 +[{{:en::manualy:kontext:fqdist-word-drevo_en.png?direct&350|Visualization type: bar  }}]
 +\\
 +By default, a bar chart with relative frequencies including 95% confidence intervals is displayed. 
 +
 +By clicking on the options above the graph using **(+)** you can modify the properties of the graph. You can display absolute frequencies instead of relative frequency values, limit the number of items in the graph, sort items alphabetically instead of frequency sorting, and export the graph as an image.
 +
 +Finally, the graph can be switched to a "word cloud," which displays a group of examined items (in our example, word forms) in sizes corresponding relatively to their frequencies. For this type of graph, only the option to export the graph and limit the number of items in the graph are relevant in the user settings.
 +
 +[{{:en::manualy:kontext:fqdist-word-cloud_en.png?direct&350|Visualization type: Word cloud  }}]
 +\\
 +
 +===== Custom settings of frequency distribution  =====
  
-  - form for multilevel frequency distribution (which can be used to analyze [[en:pojmy:atributy_pozicni|positional attributes]]) such as word, lemma, sublemma, tag, verbtag, etc.)  +The form which appears after clicking on the menu item **Frequency → Custom** offers four options:
-  - form for frequency distribution according to the [[en:pojmy:atributy_strukturni|structure attributes]] (such as ''[[en:pojmy:txtype|txtype]]'', ''[[en:pojmy:medium|med]]'' or ''[[en:pojmy:srclang|srclang]]''+
-  - form for frequency distribution reflecting the two-attribute interrelationship (both positional and structure attributes)+
  
 +  - multilevel frequency distribution (which can be used to analyze [[en:pojmy:atributy_pozicni|positional attributes]]) such as word, lemma, sublemma, tag, verbtag, etc.) 
 +  - frequency distribution according to the [[en:pojmy:atributy_strukturni|structure attributes]] (such as ''[[en:pojmy:txtype|txtype]]'', ''[[en:pojmy:medium|med]]'' or ''[[en:pojmy:srclang|srclang]]'')
 +  - dispersion plot showing the distribution of the searched concordance across the entire corpus
 +  - 2-dimensional frequency distribution reflecting the relationship between two attributes (both positional and structure attributes)
  
 [{{ :en:manualy:kontext:fqdist-pozice_en.png?direct&300|Form for multilevel frequency distribution ([[en:pojmy:atributy_pozicni|positional attributes]]) }}] [{{ :en:manualy:kontext:fqdist-pozice_en.png?direct&300|Form for multilevel frequency distribution ([[en:pojmy:atributy_pozicni|positional attributes]]) }}]
Line 28: Line 79:
  
 Afterwards, it is necessary to select whether frequency distribution should be calculated regardless of the letter case. Selection of the option [[wp>Case_sensitivity|case-insensitive]] causes that all of the items are interpreted as having lower case, regardless of what type of case they actually have in the corpus.   Afterwards, it is necessary to select whether frequency distribution should be calculated regardless of the letter case. Selection of the option [[wp>Case_sensitivity|case-insensitive]] causes that all of the items are interpreted as having lower case, regardless of what type of case they actually have in the corpus.  
 +
 +[{{ :en:manualy:kontext:fqdist-reference_en.png?direct&300|Form for frequency distribution according to [[en:pojmy:atributy_strukturni|structural attributes]] }}]
  
 In case of custom settings of frequency distribution, we do not need to restrict ourselves to KWIC only (unlike when working with quick selection). It can be calculated from any context position to the right or left from the wanted word. The item //position// in the form enables us to select not only positions from the left (the preceding) context (6L-1L), but also KWIC itself and positions to the right (the following) context (1R-6R). The numbering of the positions (according to both current and older notation) is summed up in the following table: In case of custom settings of frequency distribution, we do not need to restrict ourselves to KWIC only (unlike when working with quick selection). It can be calculated from any context position to the right or left from the wanted word. The item //position// in the form enables us to select not only positions from the left (the preceding) context (6L-1L), but also KWIC itself and positions to the right (the following) context (1R-6R). The numbering of the positions (according to both current and older notation) is summed up in the following table:
Line 42: Line 95:
  
 If we wish to create frequency distribution of not only individual units but also pairs of words ([[en:pojmy:bigram|bigrams]]) or even longer phrases, we have to add another level of frequency distribution. Another line will be added to the form with the identical setting options.The quick option of frequency distribution **Node forms** represents an easier option. - if we apply it to multi-word KWIC (e.g. when searching for two consecutive adverbs such as //pomalu a opatrně// [tag=<nowiki>"</nowiki>D.*<nowiki>"</nowiki>][word=<nowiki>"</nowiki>a<nowiki>"</nowiki>][tag=<nowiki>"</nowiki>D.*<nowiki>"</nowiki>]), the wanted multi-word expressions ordered according to frequency will appear without any complicated settings. If we wish to create frequency distribution of not only individual units but also pairs of words ([[en:pojmy:bigram|bigrams]]) or even longer phrases, we have to add another level of frequency distribution. Another line will be added to the form with the identical setting options.The quick option of frequency distribution **Node forms** represents an easier option. - if we apply it to multi-word KWIC (e.g. when searching for two consecutive adverbs such as //pomalu a opatrně// [tag=<nowiki>"</nowiki>D.*<nowiki>"</nowiki>][word=<nowiki>"</nowiki>a<nowiki>"</nowiki>][tag=<nowiki>"</nowiki>D.*<nowiki>"</nowiki>]), the wanted multi-word expressions ordered according to frequency will appear without any complicated settings.
- 
-[{{ :en:manualy:kontext:fqdist-reference.png?direct&300|Form for frequency distribution according to [[en:pojmy:atributy_strukturni|structural attributes]] }}] 
  
 Provided that we are satisfied with the specification, we may begin the calculation by clicking on the **Make frequency list** button. All of the items with at least one occurrence  will appear in the basic settings. If we wish to narrow the list down, we may set **Frequency limit** to the value which satisfies the situation. Provided that we are satisfied with the specification, we may begin the calculation by clicking on the **Make frequency list** button. All of the items with at least one occurrence  will appear in the basic settings. If we wish to narrow the list down, we may set **Frequency limit** to the value which satisfies the situation.
- 
  
 ==== Text Type frequency distribution ==== ==== Text Type frequency distribution ====
Line 56: Line 106:
 Even in this form we may set the frequency limit, if we wish to restrict the number of results in the list. With the option **Include categories with no hits** it is also possible to display those attributes in the list which did not appear in the concordance. Lemma //dřevo// has not once appeared in the songs (txtype [[en:seznamy:txtype|SON]]). Provided that this option is ticked, txtype SON will appear in the frequency distribution even with a zero frequency. Even in this form we may set the frequency limit, if we wish to restrict the number of results in the list. With the option **Include categories with no hits** it is also possible to display those attributes in the list which did not appear in the concordance. Lemma //dřevo// has not once appeared in the songs (txtype [[en:seznamy:txtype|SON]]). Provided that this option is ticked, txtype SON will appear in the frequency distribution even with a zero frequency.
  
-==== Two-attribute interrelationship frequency distribution ==== +=== Usage example: frequency list according to text types ===
-[{{ :en::manualy:kontext:2d-fqdist_en.png?direct&350|Result of a 2D frequency distribution}}] +
- +
-The last type of frequency distribution reflects the interrelationship of two selected attributes (positional as well as structural). As an example, we can look at which nominal adjectives (the so-called short forms, such as rád or schopen) are prominent in three basic text type groups. First, choose the **Two-attribute interrelationship** in the **Frequency** option in the menu (under **Custom**) and select two attributesfirst, choose **lemma** (displayed as rows in the table of results), and second, choose **doc.txtype_group** (among Text types, displayed as columns in the table). You can also adjust the minimal value or percentile of [[en:pojmy:frekvence|absolute or relative frequency]]. +
- +
- +
-After clicking on **Make frequency list**, a table of results is displayed summarizing the number of occurrences of the adjectives in three selected text type groups (fiction, non-fiction and journalistic texts), sorted by frequency. This default setting can be changed: you can re-sort the table by [[en:pojmy:ipm|ipm]], switch the orientation of rows and columns or opt for a list of attribute pairs. If you are an advanced user, you can also try to sort the rows based on three criteria (attribute value, the total of absolute/relative frequency in a row or in a column), set the confidence interval (CI) or temper with the color mapping (for further information, see the help question mark next to the **Color mapping** choice). If you choose the relative frequency display, you can also look at a graph with confidence intervals by clicking on the chart icon next to each variable. +
- +
- +
-===== Frequency list (summary) ===== +
- +
-[{{ :en:manualy:kontext:fqdist-word-drevo.png?direct&300|Frequency list of the words of lemma //dřevo// }}] +
- +
-The following examples show how to use frequency list when  working with the [[en:cnk:syn2010|SYN2010]] corpus to search for a query of [[en:pojmy:lemma|lemma]] //dřevo//+
-(''[lemma=%%"%%dřevo%%"%%]'').  +
-  - Frequency list of the words of lemma //dřevo// regardless of case and with a zero frequency limit. +
-  - Frequency distribution of the values of structural attributes ''txtype'' and ''txtype_group''  of lemma //dřevo// (including the values with zero frequency) +
-   +
-Different kinds of information will appear by every [[en:pojmy:word|word]] (attribute) displayed in the frequency list of lemma //dřevo//. The basic information is located in the frequency column and displays absolute frequency of a given item in the searched concordance (if the concordance was altered in some way before the frequency list was submitted - e.g. with filters - the frequency list will be altered accordingly). In the list to the left from the word, there are located links **p/n** which can be used for a quick display of positive or negative [[en:manualy:kontext:filtr|filter]]. By clicking on the **p** in the line displaying frequency for the word //dřevem//, we filter out this form from the current concordance,in the same way when **n** is activated, all of the occurrence of the given form will be eliminated from the current concordance. +
- +
-The last column of the frequency list contains a horizontal bar chart. It is used for completing the differences between absolute frequencies of the individual items (the length of the horizontal lines should correspond to the word frequency). +
- +
-After clicking on the heading of the column, the table will automatically be rearranged according to the selected column. This way it is possible to create a list that is arranged alphabetically (in addition to the usual list arranged according to the frequency).+
  
 [{{ :en:manualy:kontext:fqdist-txtype-drevo_en.png?direct&300|Frequency list of the text types and their groups of lemma //dřevo// }}] [{{ :en:manualy:kontext:fqdist-txtype-drevo_en.png?direct&300|Frequency list of the text types and their groups of lemma //dřevo// }}]
  
-The summary of frequency list arranged according to the **structural attributes** has slightly different structure. Both the column with absolute frequency and column enabling quick filtering remain the same (in some cases only the option of negative filter is disabled). +The following example shows how to use frequency list when  working with the [[en:cnk:syn2020|SYN2020]] corpus to search for a query of [[en:pojmy:lemma|lemma]] //dřevo// (''[lemma=%%"%%dřevo%%"%%]''): Frequency distribution of the values of structural attributes ''txtype'' and ''txtype_group''  of lemma //dřevo// (excluding the values with zero frequency).
  
-In the latest version, every item (the value of the selected structural attribute) also contains [[en:pojmy:ipm|i.p.m.]] value. It conveys the relative frequency of phenomena displayed in the concordance in relation to the overall size of the corpus part with a given value of structural attribute. In our example, lemma //dřevo// appears in the corpus [[en:cnk:syn2010|SYN2010]]  with a frequency of 3509 in specialized literature. Considering the overall ratio of specialized literature in the corpus (27%), it accounts for 107,9  of instances per million of words (i.p.m.). Even though the absolute frequency of lemma //dřevo// is comparable in fiction and specialized literature (3276 versus 3509), considering the difference in sizes of these two sections the relative frequency in specialized literature is almost twice as big (65,9 versus 107,9).+The summary of frequency list arranged according to the **structural attributes** has the same structure as the list arranged according to the positional attributes. Here, the **[[en:pojmy:ipm|i.p.m.]] value** is of special importance. It conveys the relative frequency of phenomena displayed in the concordance in relation to the overall size of the corpus part with a given value of structural attribute. In our example, lemma //dřevo// appears in the corpus [[en:cnk:syn2020|SYN2020]]  with a frequency of 3,566 in specialized literature. Considering the overall ratio of specialized literature in the corpus (33%), it accounts for 88.55  of instances per million of words (i.p.m.).
    
-The difference between absolute and relative frequency is also shown in the horizontal bar charts. The length of the line represents relative frequency, while the width represents absolute frequency. They are useful for quick examination of the results. 
- 
 Just like the items, the structural attributes can also be rearranged in the table according to any column. This is especially useful when we need to know the order according to the relative frequency which allows for comparison of the number of occurrences even in the corpora of different sizes. Just like the items, the structural attributes can also be rearranged in the table according to any column. This is especially useful when we need to know the order according to the relative frequency which allows for comparison of the number of occurrences even in the corpora of different sizes.
  
-[{{:en::manualy:kontext:fqdist-word-drevo_tab_en.png?direct&350|Frequency list of the words of lemma //dřevo//  }}]+==== Disperze ====
  
 +The [[pojmy:frekvence#disperze_jevu|Dispersion]] function allows you to graphically represent the distribution of a given searched phenomenon across the text/corpus. In the initial form you need to set the number of sections (maximum 1000) into which the corpus will be divided for the purpose of displaying the dispersion. The resulting graph then shows the number of occurrences of the searched phenomenon within each section on the y-axis.
  
-[{{:en::manualy:kontext:fqdist-word-drevo_en.png?direct&350|Frequency list of the words of lemma //dřevo//  }}]+[{{en:manualy:kontext:disperze.png?direct&450|Dispersion of the lemma //dřevo// (division into 100 sections) in SYN2020}}]
  
  
-[{{:en::manualy:kontext:fqdist-word-cloud_en.png?direct&350|Frequency list of the words of lemma //dřevo//  }}]+==== Two-attribute interrelationship frequency distribution ==== 
 + 
 +The last type of frequency distribution reflects the interrelationship of two selected attributes (positional as well as structural). As an example, we can look at which nominal adjectives (the so-called short forms, such as //rád// or //schopen//: ''%%[tag="AC..-.*"]%%'', excluding postprepositional forms of the type //na živo//) are prominent in three basic text type groups. First, choose the **Two-attribute interrelationship** in the **Frequency** option in the menu (under **Custom**) and select two attributesfirst, choose **lemma** (displayed as rows in the table of results), and second, choose **doc.txtype_group** (among Text types, displayed as columns in the table). You can also adjust the minimal value or percentile of [[en:pojmy:frekvence|absolute or relative frequency]]. 
 + 
 +After clicking on **Make frequency list**, a table of results is displayed summarizing the number of occurrences of the adjectives in three selected text type groups (fiction, non-fiction and journalistic texts), sorted by frequencyThis default setting can be changed: you can re-sort the table by [[en:pojmy:ipm|ipm]], switch the orientation of rows and columns or opt for a list of attribute pairs. If you are an advanced user, you can also try to sort the rows based on three criteria (attribute value, the total of absolute/relative frequency in a row or in a column), set the confidence interval (CI) or temper with the color mapping (for further information, see the help question mark next to the **Color mapping** choice). If you choose the relative frequency display, you can also look at a graph with confidence intervals by clicking on the chart icon next to each variable.
  
 +[{{:en::manualy:kontext:2d-fqdist_en.png?direct&400|Result of a 2D frequency distribution}}]
 +\\
  
-[{{:en::manualy:kontext:fqdist-reference_en.png?direct&350|Frequency list of the words of lemma //dřevo//  }}] 
 ---- ----