Příručka ČNK - en:pojmy

Příručka ČNK - en:pojmy Báze znalostí z korpusové lingvistiky http://wiki.korpus.cz/ 2026-07-30T05:07:34+00:00 Příručka ČNK http://wiki.korpus.cz/ http://wiki.korpus.cz/lib/exe/fetch.php/wiki:dokuwiki.svg text/html 2026-01-25T22:48:11+00:00 Anonymous (anonymous@undisclosed.example.com) anotace_mwe http://wiki.korpus.cz/doku.php/en:pojmy:anotace_mwe?rev=1769381291&do=diff Annotation of Multiword Expressions Specialized tools are being developed for the automatic identification of multiword expressions (phrasemes and collocations) in corpora. MWE lemmatization and tagging Starting with the SYNv14 corpus, multiword expressions are annotated in corpora using new lemmas and tags linked to the text/html 2016-12-12T15:22:22+00:00 Anonymous (anonymous@undisclosed.example.com) arf http://wiki.korpus.cz/doku.php/en:pojmy:arf?rev=1481556142&do=diff ARF (average reduced frequency) ARF is one of the many adjusted frequencies of a word form in a corpus. Adjusted frequencies adjust the simple frequency (number of occurrences) of a given word or phenomenon in the corpus to the degree of the uniformity of how its occurrences are distributed, taking into account dispersion. $$ARF = \frac{1}{v} \sum_{i=1}^{f} \min (d_{i}, v)$$$f$$N$$d_{i}$$v$$v = \frac{N}{f}$$d_{i} = v$$i$ text/html 2019-10-15T18:59:09+00:00 Anonymous (anonymous@undisclosed.example.com) din http://wiki.korpus.cz/doku.php/en:pojmy:din?rev=1571165949&do=diff DIN The DIN (Difference index) is a so called effect-size metric, i.e. a measure designed for the purpose of quantifying the relevance differences between values. The DIN is implemented for extracting prominent units from a text (keywords) in the KWords tool.$$DIN = 100 \times \frac{RelFq(Ttxt) - RelFq(RefC)}{RelFq(Ttxt) + RelFq(RefC)}$$ text/html 2020-12-21T18:05:37+00:00 Anonymous (anonymous@undisclosed.example.com) dotazovaci_jazyk http://wiki.korpus.cz/doku.php/en:pojmy:dotazovaci_jazyk?rev=1608573937&do=diff Query language Query languages are used to query database systems in information technologies; every system uses a query language with precisely defined syntax. For work with language corpora, the query language is used for inputting queries into text/html 2020-08-10T14:40:21+00:00 Anonymous (anonymous@undisclosed.example.com) frekvence http://wiki.korpus.cz/doku.php/en:pojmy:frekvence?rev=1597070421&do=diff Frequency In corpus linguistics, frequency is the number of times a given form or phenomenon occurs in the corpus. It is either given as an absolute value, e.g. the lemma pes occurs in the 100 million word corpus SYN2010 17 701 times, or as a relative value, e.g. the lemma $REL = \frac{ABS}{N} \times 1000000$$rr = \frac{r}{n}$$ E = p(A) \times N $$ p(\text{škola}) = \frac{f(\text{škola})}{N} = \frac{47872}{122419382} = 0,0003910492 = 3,91 \cdot 10^{-4} $$ E(\text{škola}) = p(\text{škola}) \time… text/html 2016-06-13T07:36:18+00:00 Anonymous (anonymous@undisclosed.example.com) ipm http://wiki.korpus.cz/doku.php/en:pojmy:ipm?rev=1465803378&do=diff ipm The abbreviations ipm (instances per million) and ppm (parts per million) are measures of relative frequency. They express the average number of occurences of the unit or word in a hypothetical text/corpus with the size of 1 million words. Eg. The node form běžeckých text/html 2016-06-13T08:24:28+00:00 Anonymous (anonymous@undisclosed.example.com) konkordance http://wiki.korpus.cz/doku.php/en:pojmy:konkordance?rev=1465806268&do=diff Concordance A concordance represents all events (occurrences) of the searched phenomenon in the corpus along with the surrounding context. In practice, within the concordance we single out the KWIC (i.e. key word in context), which is the searched word/phenomenon and its right and left context. One line of the concordance list is called a concordance line. text/html 2024-02-08T09:15:12+00:00 Anonymous (anonymous@undisclosed.example.com) kwic http://wiki.korpus.cz/doku.php/en:pojmy:kwic?rev=1707383712&do=diff KWIC KWIC is the English abbreviation of key word in context, which is used to label the search term (or a sequence of terms) in contexts of various sizes. The Czech equivalent keyword is homonymous with the term denoting items which are prominent thanks to their frequency in the text, serving as a basis for text analysis. ( text/html 2022-04-20T12:07:42+00:00 Anonymous (anonymous@undisclosed.example.com) lemma http://wiki.korpus.cz/doku.php/en:pojmy:lemma?rev=1650456462&do=diff Lemma A lemma is a representative dictionary form of a word, and in the proces of lemmatization during automatic language processing it is the form which is assigned to every form of the given word in the corpus. Approaches to lemmatization can differ in specific details, but it is generally the case that: text/html 2024-10-18T19:07:46+00:00 Anonymous (anonymous@undisclosed.example.com) lexikalni_bohatost http://wiki.korpus.cz/doku.php/en:pojmy:lexikalni_bohatost?rev=1729278466&do=diff Lexical Diversity * InterCorp release 16ud is annotated by following two measures of lexical diversity. They are specified as metadata for each text of sufficient length, for each linguistically annotated language: * lexDivWord: average number of different word forms per 1000 tokens text/html 2024-06-21T20:41:09+00:00 Anonymous (anonymous@undisclosed.example.com) prehled_pojmu http://wiki.korpus.cz/doku.php/en:pojmy:prehled_pojmu?rev=1719002469&do=diff Corpus linguistics – key terminology See, e.g., Corpus Linguistics Glossary (Kent State University, Ohio) or A Glossary of Corpus Linguistics (by Paul Baker, Andrew Hardie and Tony McEnery, Edinburgh University Press, 2006). C Complexity D Diversity L Lexical Diversity S Syntactic Complexity U UD Universal Dependencies text/html 2016-05-03T09:41:25+00:00 Anonymous (anonymous@undisclosed.example.com) regularni_vyrazy http://wiki.korpus.cz/doku.php/en:pojmy:regularni_vyrazy?rev=1462268485&do=diff Regular expressions Regular expressions (the term comes from a theory of formal languages, but its meaning as it is used in IT is slightly different) allow us to accurately describe the set of text strings matching the search term or phenomenon. For these purposes, text/html 2026-01-19T09:45:47+00:00 Anonymous (anonymous@undisclosed.example.com) syntakticka_analyza http://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_analyza?rev=1768815947&do=diff Syntactic analysis and syntactic tagging Some of CNC corpora (the first of which is SYN2015) are syntactically annotated, marking dependency relations between two words in a sentence and the analytical functions of individual words. This syntactic annotation is based on the principles of the analytical-layer annotation used in the text/html 2024-10-18T18:39:52+00:00 Anonymous (anonymous@undisclosed.example.com) syntakticka_komplexita http://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_komplexita?rev=1729276792&do=diff Syntactic Complexity InterCorp release 16ud is annotated by several measures of syntactic complexity. They are specified as metadata for each sentence and each text, for each linguistically annotated language. In KonText, they can be displayed and queried like any other metadata items, such as text author or sentence ID. text/html 2026-01-16T11:04:36+00:00 Anonymous (anonymous@undisclosed.example.com) tag http://wiki.korpus.cz/doku.php/en:pojmy:tag?rev=1768561476&do=diff Morphological tags A morphological tag (commonly called tag) is a summary of the grammatical information about a specific word (position ) in the given context. A tag is usually automatically generated based on a morphological analysis and a subsequent disambiguation. Tags are positional attributes. A morphological tag in the Czech CNC corpora consists of a sequence of symbols (letters and numbers) which have a specific meaning based on the position which they occupy in the code. In the Czech… text/html 2024-10-08T19:50:31+00:00 Anonymous (anonymous@undisclosed.example.com) ud http://wiki.korpus.cz/doku.php/en:pojmy:ud?rev=1728417031&do=diff Universal Dependencies – UD Universal Dependencies is a an open international project aiming at linguistic annotation consistent across different languages. Some recent versions of the InterCorp parallel corpus (13ud and 16ud) have been annotated in terms of morphological categories, syntactic functions and syntactic structure following the UD guidelines and using the tools developed within the UD project. text/html 2016-12-08T13:04:23+00:00 Anonymous (anonymous@undisclosed.example.com) word http://wiki.korpus.cz/doku.php/en:pojmy:word?rev=1481202263&do=diff Word form (word) A word form (known as a word in corpus terminology) is a unit which remains morphologically (and possibly also orthographically) specific. With its generality it stands between a token and a lemma. While a token is one specific realization of a given unit, a word form is a standardized unit; a