Next revision | Previous revision |
en:pojmy:arf [2016/12/12 15:45] – created veronikapojarova | en:pojmy:arf [2016/12/12 16:22] (current) – [ARF values] veronikapojarova |
---|
Its definition is as follows: We use the letter //f// to label the frequency of a given word in the corpus. We divide the positions in the entire corpus into //f// sections of equal size. If the total number of words in the corpus should be divisible by //f//, the sections would be the same size; in the opposite case they may differ in one position. A reduced frequency is then the number of sections in which the given word occurs at least once. | Its definition is as follows: We use the letter //f// to label the frequency of a given word in the corpus. We divide the positions in the entire corpus into //f// sections of equal size. If the total number of words in the corpus should be divisible by //f//, the sections would be the same size; in the opposite case they may differ in one position. A reduced frequency is then the number of sections in which the given word occurs at least once. |
| |
První slovo z našeho příkladu bude mít redukovanou četnost buď 1, padnou-li všechny jeho výskyty do jednoho úseku, nebo 2, jestliže náhodou bude hranice mezi dvěma úseky uprostřed shluku výskytů. Druhé slovo bude mít redukovanou četnost mnohem vyšší. V krajním případě může být teoreticky redukovaná četnost stejná jako četnost, a to právě tehdy, když každý výskyt daného slova padne do jednoho úseku. Prakticky se toto většinou nestává, alespoň ne pro slova s vyšší četností. | The first word from our example will have a reduced frequency of either 1 (if all of its occurrences fall under on section) or 2 (if the boundary between two sections should happen to be in the middle of a cluster of occurrences. The second word will have a much higher value for reduced frequency. In extremely unlikely cases the reduced frequency could theoretically be the same as the frequency, which would happen should every occurrence of a given word fall under one single section. This very rarely happens in reality, especially as far as words with higher frequencies are concerned. |
| |
The average reduced frequency (ARF) is then derived from the reduced frequency in the sense that it takes into account all possible compilations of the corpus (the order of the texts in it). It is calculated as an average of the reduced frequency from all possible compilations of the corpus. | The average reduced frequency (ARF) is then derived from the reduced frequency in the sense that it takes into account all possible compilations of the corpus (the order of the texts in it). It is calculated as an average of the reduced frequency from all possible compilations of the corpus. |
===== ARF values ===== | ===== ARF values ===== |
| |
Protože //N// je dělitelné //f// pouze výjimečně, nabývá ARF typicky neceločíselných hodnot, což je pro upravené frekvence běžné. Hodnota ARF pro daný výraz je korekcí jeho frekvence založenou na rozložení jeho výskytů v korpusu: čím je rozložení rovnoměrnější, tím více se hodnota ARF blíží frekvenci a naopak; pro výrazy, jejichž výskyty jsou v korpusu soustředěny do jediného shluku, se hodnota ARF blíží jedné bez ohledu na frekvenci. | Because //N// is divisible by //f// only very rarely, the ARF typically takes on fractional values, which is common for adjusted frequencies. The ARF value for a given expression is a correction of its frequency based on the distribution of its occurrence is the corpus: the more even the distribution, the closer the ARF value will be to the frequency and vice versa; for expressions whose occurrences center around one cluster in the corpus, the ARF will be close to 1 regardless of frequency. |
| |
Maximální hodnota ARF je tedy rovna frekvenci (je-li $d_{i} = v$ pro všechna $i$, tj. jsou-li vzdálenosti mezi všemi výskyty daného výrazu shodné), její nejmenší hodnota je rovna jedné. | The maximum ARF value is equal to the frequency (if $d_{i} = v$ for all $i$, i.e. if the distance between all occurrences of the expression is the same), and its lowest possible value is equal to 1. |
| |
Hodnota ARF se pro frekventovaná slova s rovnoměrným rozložením výskytů pohybuje okolo třetiny jejich frekvence (specificky však jen pro frekvenci větší než 50 000), pro odborné termíny vyskytující se pouze v několika dokumentech ale může být i mnohonásobně (10-krát až 100-krát) menší než frekvence. ARF je ve srovnání s frekvencí mnohem méně náchylná na (ne)zařazení konkrétních textů do korpusu, a lépe tedy odpovídá intuitivně chápané běžnosti slov. | The value of ARF for high frequency expressions with an even distribution of occurrences is approximately a third of their frequency (but specifically only for frequencies over 50 000), however for technical terms occurring only in several documents it can be significantly (10 to 100 times) lower than the frequency. ARF is in comparison to the frequency much less sensitive to the (non-)inclusion of specific texts in the corpus, and therefore corresponds better to the intuitive understanding of "common words". |
| |
ARF je v českém prostředí známá díky její implementaci v někdejším korpusovém manažeru [[pojmy:korpusovy_manazer|Manatee/Bonito]] (dnes v rozhraní [[manualy:kontext:index|KonText]]), obstála také ve srovnání s ostatními běžně používanými upravenými frekvencemi a disperzními mírami.((Gries, S. T.: //Dispersions and adjusted frequencies in corpora//. In International Journal of Corpus Linguistics 13, 2008, 403–437.)) Mimoto se ARF prakticky osvědčila jako hlavní kritérium pro stanovení běžnosti slov při sestavování obou nejnovějších frekvenčních slovníků češtiny. | ARF became known in the Czech environment thanks to its implementation in the former corpus manager [[en:pojmy:korpusovy_manazer|Manatee/Bonito]] (today in the [[en:manualy:kontext:index|KonText]] interface), and did well in comparison with other commonly used adjusted frequencies and dispersion rates.((Gries, S. T.: //Dispersions and adjusted frequencies in corpora//. In International Journal of Corpus Linguistics 13, 2008, 403–437.)) Apart from this, the ARF was proven to work in practice as the main criterion for determining word commonness in the compilation of both the newest frequency dictionaries of Czech. |
| |
--- //M. Křen, V. Cvrček// | --- //M. Křen, V. Cvrček// |