Differences

This shows you the differences between two versions of the page.

--- en:pojmy:frekvence [2016/12/12 17:44] – [Využití a význam frekvence] veronikapojarova
+++ en:pojmy:frekvence [2020/08/10 16:40] (current) – [The use and significance of frequency] vaclavcvrcek
@@ Line 21: / Line 21: @@
 where //n// is the number of [[en:pojmy:typ|types]] in the corpus.
-===== Naměřená a očekávaná frekvence =====
+===== Measured and expected frequency =====
-Vedle hodnot, které v korpusu u jednotlivých jevů zjistíme, se pracuje také s hodnotami, které na základě externích informací (např. předchozích výzkumů prováděných na jiných datech) můžeme v korpusu očekávat. V anglické terminologii jde o rozdíl mezi hodnotami **O** (observed) a **E** (expected). Poměřováním těchto ukazatelů můžeme dospět k zjištění, zda je nebo není zkoumaný jev v korpusu nápadně frekventovaný, což může sloužit k identifikaci některých specifických jevů (např. [[pojmy:kolokace|kolokací]], [[pojmy:keyword|klíčových slov]] apod.).
+Aside from the values returned for the individual phenomena in the corpus, we also work with values which we can expect in the corpus based on external information (e.g. previous research conducted on different data). It is the difference between the values **O** (observed) and **E** (expected). By examining these indicators we can discover whether or not the given phenomenon is noticeably or unusually frequent in the corpus, which can lead to the identification of some specific phenomena (e.g. [[en:pojmy:kolokace|collocations]], [[en:pojmy:keyword|keywords]] etc.).
-Známe-li pravděpodobnost výskytu slova, můžeme pomocí jednoduchého vzorce zjistit, jaká je očekávaná frekvence tohoto slova v korpusu o dané délce.
+If we know the probability of a word's occurrence, we can use a simple formula to find the expected frequency of the given word in a corpus of a specified size.
 $ E = p(A) \times N $
-kde:
+where:
-  * //p(A)// je pravděpodobnost slova //A//
+  * //p(A)// is the probability of word //A//
-  * //N// je velikost korpusu v počtu [[pojmy:token|tokenů]]
+  * //N// is the size of the corpus in numbers of [[en:pojmy:token|tokens]]
-Pravděpodobnost jevu v populaci všech projevů nikdy přesně nepoznáme, můžeme ji však aproximovat relativní frekvencí zjištěnou v předchozích pozorováních na jiných datech, tedy v jiných korpusech. V korpusu [[cnk:syn2005|SYN2005]] tak např. můžeme zjistit pravděpodobnost výskytu [[pojmy:lemma|lemmatu]] //škola// z jeho frekvence (f = 47872) a z celkové velikosti tohoto korpusu (N = 122419382):
+We will never know the exact probability of the phenomenon in a population of all manifestations, but it can be approximated by the relative frequency discovered in previous comparisons using different data (other corpora). In the [[en:cnk:syn2005|SYN2005]] corpus we can therefore determine the probability of the occurrence of the [[en:pojmy:lemma|lemma]] //škola// ('school') from its frequency (f = 47872) and from the total size of the corpus (N = 122419382):
 $ p(\text{škola}) = \frac{f(\text{škola})}{N} = \frac{47872}{122419382} = 0,0003910492 = 3,91 \cdot 10^{-4} $
-Na základě této pravděpodobnosti můžeme vypočíst očekávanou frekvenci lemmatu //škola// v korpusu [[cnk:syn2010|SYN2010]] (N = 121667413).
+Based on this probability we can calculate the expected frequency of the lemma //škola// in the corpus [[en:cnk:syn2010|SYN2010]] (N = 121667413).
 $ E(\text{škola}) = p(\text{škola}) \times N = 3,91 \cdot 10^{-4} \times 121667413 = 47577,9 $
-Hledáním v korpusu SYN2010 můžeme snadno zjistit, jaká je reálná naměřená frekvence tohoto lemmatu:
+By searching the SYN2010 corpus we can easily find the actual frequency of this lemma:
 $ O(\text{škola}) = 51104 $
-Naměřené a očekávané hodnoty pak můžeme porovnávat, např. pomocí [[pojmy:chi2|chi2 testu]].
+The measured and expected values can then be compared, e.g. with the aid of the [[en:pojmy:chi2|chi2 test]].
-===== Využití a význam frekvence =====
+===== The use and significance of frequency =====
-Frekvence jako základní veličina libovolné jednotky ([[pojmy:typ|typu]]) a languová (systémová) charakteristika se používá nejen k poměřování mezi alternujícími jevy (např. frekvence morfologických variant //bychom// a //bysme//, viz [[http://syd.korpus.cz/05xNuUX8.syn|SyD]]), ale slouží také ke konstruování slovníků (vymezení nejčetnějších slov jako jádra slovní zásoby), extrakci [[pojmy:kolokace|kolokací]], zhodnocení gramatických kategorií, identifikaci [[pojmy:keyword|klíčových slov]] v textech apod.
+Frequency as a fundamental characteristic of any unit ([[en:pojmy:typ|type]]) is used not only for determining the relations between alternating phenomena (e.g. the frequency of morphological variants //bychom// and //bysme//, as in [[http://syd.korpus.cz/05xNuUX8.syn|SyD]]), but it is also used in the process of dictionary compilation (e.g. in defining the most frequent words as core vocabulary), the extraction of [[en:pojmy:kolokace|collocations]], the evaluation of grammatical categories, the identification of [[en:pojmy:keyword|keywords]] in texts etc.
-Pro korektní interpretaci frekvence je třeba si uvědomit, že se jedná o bodový odhad četnosti jevu v celém jazyce. Každý korpus je více či méně přesnou aproximací zkoumané populace (= texty určitého druhu), a tudíž v různých korpusech vytvořených podle téže metodologie (i kdybychom byli schopni zaručit jejich plnou srovnatelnost) se bude frekvence zkoumaného jevu drobně lišit. K podchycení této variability hodnot slouží **[[wp>Confidence_interval|konfidenční intervaly]]**, které udávají rozmezí, v němž se skutečná četnost zkoumaného jevu s určitou pravděpodobností v populaci nachází. Pro zjištění konfidenčního intervalu využíváme [[wp>Binomial_distribution|binomické rozdělení]], vstupními hodnotami jsou frekvence jevu, velikost korpusu a hladina významnosti vyjadřující přípustnou míru omylu.
+In order to interpret frequency correctly it is necessary to realize that it is a point estimate of the frequency of phenomena in the entire language. Every corpus is more or less a precise approximation of the population in question (=texts of a certain domain), and therefore in different corpora created using the same methodology (even if it were possible to guarantee their full comparability) the frequencies of the desired phenomenon will differ slightly. This variability can be captured using the **[[wp>Confidence_interval|confidence interval]]** which gives the span containing (with a certain probability) the frequency of a given phenomenon.
-<html>
+For finding out the confidence interval we use the corpus calculator **Calc** ([[https://www.korpus.cz/calc/?module=1|www.korpus.cz/calc]]) which calculates the interval using a [[wp>Binomial_distribution|binomial distribution]], the input values being the frequency of the phenomenon, the size of the corpus and the significance level (expressing a tolerable error rate).
-<iframe id="embedded-app" src="https://trost.korpus.cz/shiny/cvrcek/confintwiki/" frameborder="0" width="100%"></iframe>
-<script>
-(function() {
-  ////////////////////////////////////////////
-  // CONFIGURE THESE TO MATCH YOUR USE CASE //
-  ////////////////////////////////////////////
-  // this should be the root URL of the child frame (Shiny app) which you want
+The confidence interval around the measured frequency on the significance level of 0.95 says that in an experiment which would encompass an infinite number of comparable corpora of the same size, the frequency of the given phenomenon would be within this interval in 95% of measurements. When conducting our analysis we should always be aware that the actual frequency of a phenomenon can acquire any value from the confidence interval.
-  // to allow to send messages to the parent
-  var allowedOrigin = "https://trost.korpus.cz"
-  ///////////////////////
-  // END CONFIGURATION //
-  ///////////////////////
-  var embeddedApp = document.getElementById("embedded-app");
-  function resizeIframe(pixels) {
-      embeddedApp.style.height = pixels + "px";
-  }
-  // cross-browser compatible infrastructure
-  var eventMethod = window.addEventListener ? "addEventListener" : "attachEvent";
-  var eventer = window[eventMethod];
-  var messageEvent = eventMethod == "attachEvent" ? "onmessage" : "message";
-  // listen to message from iframe
-  eventer(messageEvent, function(e) {
-    if (e.origin === allowedOrigin) {
-      var key = e.message ? "message" : "data";
-      var data = e[key];
-      resizeIframe(data);
-    } else {
-      console.log("Was expecting a message from " + allowedOrigin + ", got " + e.origin + " instead.");
-    }
-  }, false);
-  // send message to iframe on window resize
-  window.onresize = function() {
-    embeddedApp.contentWindow.postMessage("parentWindowResized", "*");
-  };
-})();
-</script>
-</html>
-Konfidenční interval okolo naměřené (zjištěné) frekvence na hladině významnosti 0,95 říká, že v pokusu, který by zahrnoval nekonečné množství srovnatelných a stejně rozsáhlých korpusů, by frekvence hledaného jevu byla v 95 % měření v rámci tohoto intervalu. Při analýze bychom tedy měli vždy počítat s tím, že reálná frekvence jevu může nabývat kterékoli hodnoty z konfidenčního intervalu.
 === Examples ===
@@ Line 105: / Line 61: @@
 If we measure in a corpus of  100 mil. words (e.g. [[en:cnk:syn2015|SYN2015]]) 50 occurrences for a given phenomenon, the results must be interpreted that in a population of texts which the corpus strives to represent, this phenomenon appears in the range from 37 to 66 occurrences per 100 mil. words (with a 5% error rate, i.e. with the risk that the actual result will be found outside the given interval).
-If we discover that the given pheomenon occurs in a corpus (e.g. in [[en:cnk:oral2008|ORAL2008]]) exactly three times, it means that in another fully comparable corpus the same could have an occurrence rate of up to 9 hits, or it could be absent completely (again with a 5% error rate).((Such low values also depend on the selected rounding up strategy.))
+If we discover that the given phenomenon occurs in a corpus (e.g. in [[en:cnk:oral2008|ORAL2008]]) exactly three times, it means that in another fully comparable corpus the same could have an occurrence rate of up to 9 hits, or it could be absent completely (again with a 5% error rate).((Such low values also depend on the selected rounding up strategy.))
-===== Disperze jevů =====
+===== Dispersion of phenomena =====
-V některých případech je třeba absolutní nebo relativní frekvenci doplnit ještě informací o disperzi (rozložení) daného jevu napříč textem/korpusem. I relativně velmi frekventované jevy se můžou totiž vyskytovat pouze v omezeném okruhu textů nebo v určité části dokumentu. V takových případech může být samotná frekvence jako ukazatel běžnosti prostředku údajem nespolehlivým. Za účelem kvantifikace nerovnoměrnosti rozložení slov v korpusech se užívají různé míry disperze, z nichž nejjednodušší jsou založeny na počítání počtu dokumentů, v nichž se jednotka vyskytuje, nebo autorů, kteří jí použili. Sofistikovanější způsoby zjišťování disperze prostředků využívají průměrných dílčích frekvencí v rámci jednotlivých úseků textu/korpusu, příp. počítání variačního koeficientu, tedy poměru směrodatné odchylky frekvencí v jednotlivých částech k průměru těchto dílčích frekvencí (např. Juillandův koeficient D, srov. též [[pojmy:arf|ARF]]).
+In some cases it is necessary to supplement absolute or relative frequency with information about the dispersion of the given phenomenon throughout the text/corpus. Even phenomena which are relatively very frequent can appear only in a limited circle of texts or in certain parts of the document. In such cases, the frequency itself can be an unreliable indicator of conventionality. In order to quantify the uneven distribution of words in corpora, various measures of dispersion are used, the most simple of which are based on counting the number of documents in which the unit appears, or authors who used it. More sophisticated ways of obtaining information about dispersion include using average partial frequencies within individual sections of the text/corpus, or calculating the variation coefficient i.e. the ratio of the standard deviation of frequencies in the individual sections to the average of these partial frequencies (e.g. Juilland's D coefficient, see also [[en:pojmy:arf|ARF]]).
-==== Související odkazy ====
+==== Related links ====
 <WRAP round box 49%>
-[[pojmy:arf|ARF]] • [[pojmy:asociacni_miry|Asociační míry]] • [[pojmy:ipm|ipm]] • [[pojmy:zipf|Zipfovy zákony]]
+[[en:pojmy:arf|ARF]] • [[en:pojmy:asociacni_miry|Association measures]] • [[en:pojmy:ipm|ipm]] • [[en:pojmy:zipf|Zipf's laws]]
 </WRAP>

Trace:

Differences

Search

Navigation

Print/export

Tools

Languages

Licence