In this lesson, we will focus on collocations, i.e. meaningful, fixed, syntagmatic sequence of two (or more) words in the immediate proximity. The KonText interface allows us to create collocation lists for the given word, which enables us to determine in what contexts the word or phenomenon typically occurs.
To identify a collocation, association measures are used. The interface employs the following association measures: t-score, MI, MI3, log likelihood, min. sensitivity, logDice, MI.log_f, relative frequency. It is recommended to combine the measures, as each functions differently and favours different kinds of associations, not every single one may be suitable every time. Additionally, the scores produced by different association measures cannot be directly compared.
Searching the corpus
To search for collocations, it is necessary to create a concordance of the word the collocations of which we want to find. In the case of the OBC, you may be interested in, for example, the words or phrases used to describe men or women in the proceedings, or whether particular attributes were more commonly ascribed to a specific gender (to read more about gender in the proceedings, visit here).
Let’s create a concordance of the word boy. Select the OBC from the corpora list and set the query type on Basic. Search for the form boy and when the concordance appears, click on Collocations in the top menu, then click on Custom. A form for creating a collocations list appears, in which you can specify the values used for searching for the collocations.
Once you are satisfied with your selection, you can click on the Make candidate list button. It should be noted here, that the interface does not provide you with a list of collocations, per se, but rather with candidates for collocations; as it was mentioned above, each measure is calculated differently and it is then up to the researcher to decide on the ir/relevancy of the potential collocate.
Try rearranging the list by sorting according to different association measures. We have selected logDice for our first sorting value. This measure is based only on the frequency of the node (key word) and the collocate and the frequency of the whole collocation; it is unaffected by the size of the corpus, it is thus suitable for comparing results from corpora of different sizes. The MI measure is prone to overestimating low frequency items; hence words like leetel and Scald-head appear in the first rows. The T-score measure, on the other hand, prefers words with high frequency, therefore the first rows are occupied mostly by function words, such as the, a and and, and punctuation. When sorting according to absolute frequency, the results will mostly coincide with the T-score measure.
Task:
Find out, which words frequently follow the adjectives modest and powerful.
You can find the solution here.
And that´s it. Back to the OBC main page