Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:pojmy:frekvence [2016/12/12 18:28] – [Disperze jevů] veronikapojarova | en:pojmy:frekvence [2020/08/10 16:40] (current) – [The use and significance of frequency] vaclavcvrcek |
---|
* //N// is the size of the corpus in numbers of [[en:pojmy:token|tokens]] | * //N// is the size of the corpus in numbers of [[en:pojmy:token|tokens]] |
| |
We will never know the exact probability of the phenomenon in a population of all manifestations, but it can be approximated by the relative frequency discovered in previous comparisons using different data (other corpora). In the [[en:cnk:syn2005|SYN2005]] corpus we can therefore determine the probability of the occurrence of the [[en:pojmy:lemma|lemma]] //škola// from its frequency (f = 47872) and from the total size of the corpus (N = 122419382): | We will never know the exact probability of the phenomenon in a population of all manifestations, but it can be approximated by the relative frequency discovered in previous comparisons using different data (other corpora). In the [[en:cnk:syn2005|SYN2005]] corpus we can therefore determine the probability of the occurrence of the [[en:pojmy:lemma|lemma]] //škola// ('school') from its frequency (f = 47872) and from the total size of the corpus (N = 122419382): |
| |
$ p(\text{škola}) = \frac{f(\text{škola})}{N} = \frac{47872}{122419382} = 0,0003910492 = 3,91 \cdot 10^{-4} $ | $ p(\text{škola}) = \frac{f(\text{škola})}{N} = \frac{47872}{122419382} = 0,0003910492 = 3,91 \cdot 10^{-4} $ |
===== The use and significance of frequency ===== | ===== The use and significance of frequency ===== |
| |
Frequency as a fundamental value of an arbitrary ([[en:pojmy:typ|type]]) and langue (system) characteristic is used not only for determining the relations between alternating phenomena (e.g. the frequency of morphological variants //bychom// and //bysme//, as in [[http://syd.korpus.cz/05xNuUX8.syn|SyD]]), but it also serves the compilation of dictionaries (defining the most frequent words as core vocabulary), the extraction of [[en:pojmy:kolokace|collocations]], the evaluation of grammatical categories, the identification of [[en:pojmy:keyword|keywords]] in texts etc. | Frequency as a fundamental characteristic of any unit ([[en:pojmy:typ|type]]) is used not only for determining the relations between alternating phenomena (e.g. the frequency of morphological variants //bychom// and //bysme//, as in [[http://syd.korpus.cz/05xNuUX8.syn|SyD]]), but it is also used in the process of dictionary compilation (e.g. in defining the most frequent words as core vocabulary), the extraction of [[en:pojmy:kolokace|collocations]], the evaluation of grammatical categories, the identification of [[en:pojmy:keyword|keywords]] in texts etc. |
| |
In order to interpret frequency correctly it is necessary to realize that it is a point estimate if the frequency of phenomena in the entire language. Every corpus is more or less a precise approximation of the population in question (=texts of a certain nature), and therefore in different corpora created using the same methodology (even if it were possible to guarantee their full comparability) the frequencies of the desired phenomenon will differ slightly. This variability can be captured using the **[[wp>Confidence_interval|confidence interval]]** which gives the span containing (with a certain probability) the frequency of a given phenomenon. For finding out the confidence interval we use a [[wp>Binomial_distribution|binomial distribution]], the input values being the frequency of the phenomenon, the size of the corpus and the significance level (expressing a tolerable error rate). | In order to interpret frequency correctly it is necessary to realize that it is a point estimate of the frequency of phenomena in the entire language. Every corpus is more or less a precise approximation of the population in question (=texts of a certain domain), and therefore in different corpora created using the same methodology (even if it were possible to guarantee their full comparability) the frequencies of the desired phenomenon will differ slightly. This variability can be captured using the **[[wp>Confidence_interval|confidence interval]]** which gives the span containing (with a certain probability) the frequency of a given phenomenon. |
| |
<html> | For finding out the confidence interval we use the corpus calculator **Calc** ([[https://www.korpus.cz/calc/?module=1|www.korpus.cz/calc]]) which calculates the interval using a [[wp>Binomial_distribution|binomial distribution]], the input values being the frequency of the phenomenon, the size of the corpus and the significance level (expressing a tolerable error rate). |
<iframe id="embedded-app" src="https://trost.korpus.cz/shiny/cvrcek/confintwiki/" frameborder="0" width="100%"></iframe> | |
<script> | |
(function() { | |
//////////////////////////////////////////// | |
// CONFIGURE THESE TO MATCH YOUR USE CASE // | |
//////////////////////////////////////////// | |
| |
// this should be the root URL of the child frame (Shiny app) which you want | The confidence interval around the measured frequency on the significance level of 0.95 says that in an experiment which would encompass an infinite number of comparable corpora of the same size, the frequency of the given phenomenon would be within this interval in 95% of measurements. When conducting our analysis we should always be aware that the actual frequency of a phenomenon can acquire any value from the confidence interval. |
// to allow to send messages to the parent | |
var allowedOrigin = "https://trost.korpus.cz" | |
| |
/////////////////////// | |
// END CONFIGURATION // | |
/////////////////////// | |
| |
var embeddedApp = document.getElementById("embedded-app"); | |
| |
function resizeIframe(pixels) { | |
embeddedApp.style.height = pixels + "px"; | |
} | |
| |
// cross-browser compatible infrastructure | |
var eventMethod = window.addEventListener ? "addEventListener" : "attachEvent"; | |
var eventer = window[eventMethod]; | |
var messageEvent = eventMethod == "attachEvent" ? "onmessage" : "message"; | |
| |
// listen to message from iframe | |
eventer(messageEvent, function(e) { | |
if (e.origin === allowedOrigin) { | |
var key = e.message ? "message" : "data"; | |
var data = e[key]; | |
resizeIframe(data); | |
} else { | |
console.log("Was expecting a message from " + allowedOrigin + ", got " + e.origin + " instead."); | |
} | |
}, false); | |
| |
// send message to iframe on window resize | |
window.onresize = function() { | |
embeddedApp.contentWindow.postMessage("parentWindowResized", "*"); | |
}; | |
})(); | |
</script> | |
</html> | |
| |
The confidence interval around the measured frequency on the significance level of 0,95 says that in an experiment which would encompass an infinite number of comparable corpora of the same size, the frequency of the given phenomenon would within this interval in 95% of measurements. When conducting our analysis we should always be aware that the actual frequency of a phenomenon can acquire any value from the confidence interval. | |
| |
=== Examples === | === Examples === |
===== Dispersion of phenomena ===== | ===== Dispersion of phenomena ===== |
| |
In some cases it is necessary to supplement absolute or relative frequency with information about the dispersion of the given phenomenon throughout the text/corpus. Even phenomena which are relatively very frequent can appear only in a limited circle of texts or in certain parts of the document. In such cases, the frequency itself can be an unreliable indicator of conventionality. In order to quantify the uneven distribution of words in corpora, various measures of dispersion are used, the most simple of which are based on counting the number of documents in which the unit appears, or authors who used it. Sofistikovanější způsoby zjišťování disperze prostředků využívají průměrných dílčích frekvencí v rámci jednotlivých úseků textu/korpusu, příp. počítání variačního koeficientu, tedy poměru směrodatné odchylky frekvencí v jednotlivých částech k průměru těchto dílčích frekvencí (např. Juillandův koeficient D, srov. též [[pojmy:arf|ARF]]). | In some cases it is necessary to supplement absolute or relative frequency with information about the dispersion of the given phenomenon throughout the text/corpus. Even phenomena which are relatively very frequent can appear only in a limited circle of texts or in certain parts of the document. In such cases, the frequency itself can be an unreliable indicator of conventionality. In order to quantify the uneven distribution of words in corpora, various measures of dispersion are used, the most simple of which are based on counting the number of documents in which the unit appears, or authors who used it. More sophisticated ways of obtaining information about dispersion include using average partial frequencies within individual sections of the text/corpus, or calculating the variation coefficient i.e. the ratio of the standard deviation of frequencies in the individual sections to the average of these partial frequencies (e.g. Juilland's D coefficient, see also [[en:pojmy:arf|ARF]]). |
| |
==== Související odkazy ==== | ==== Related links ==== |
| |
<WRAP round box 49%> | <WRAP round box 49%> |
[[pojmy:arf|ARF]] • [[pojmy:asociacni_miry|Asociační míry]] • [[pojmy:ipm|ipm]] • [[pojmy:zipf|Zipfovy zákony]] | [[en:pojmy:arf|ARF]] • [[en:pojmy:asociacni_miry|Association measures]] • [[en:pojmy:ipm|ipm]] • [[en:pojmy:zipf|Zipf's laws]] |
</WRAP> | </WRAP> |