This is an old revision of the document!
Lesson 8: Collocations
In this lesson, we will focus on collocations, i.e. meaningful, fixed, syntagmatic sequence of two (or more) words in the immediate proximity. The KonText interface allows us to create collocation lists for the given word, which enables us to determine in what contexts the word or phenomenon typically occurs.
To identify a collocation, association measures are used. The interface employs the following association measures: t-score, MI, MI3, log likelihood, min. sensitivity, logDice, MI.log_f, relative frequency. It is recommended to combine the measures, as each functions differently and favours different kinds of associations, not every single one may be suitable every time. Additionally, the scores produced by different association measures cannot be directly compared.
Searching the corpus
To search for collocations, it is necessary to create a concordance of the word the collocations of which we want to find. In the case of the OBC, you may be interested in, for example, the words or phrases used to describe men or women in the proceedings, or whether particular attributes were more commonly ascribed to a specific gender (to read more about gender in the proceedings, visit here).
Let’s create a concordance of the word boy. Select the OBC from the corpora list and set the query type on Basic. Search for the form boy and when the concordance appears, click on Collocations in the top menu, then click on Custom. A form for creating a collocations list appears, in which you can specify the values used for searching for the collocations.
- Attribute: You can select either word or tag; the collocation list will then consist of either specific words or part-of-speech tags.
- Collocation window span: Specifies the proximity to the key word, the default value is -3 to 3, which means all the words which occur in the first, second and third positions to the left and to the right of the key word will be considered.
- Minimum collocate frequency in the corpus: Determines the least number of occurrences in the concordance for the word/tag to be included on the collocations list. The default minimum frequency is 3, which means that forms with fewer occurrences in the concordance will not be included in the list of collocates.
- Minimum collocate frequency in the span: Determines how frequently an item should co-occur with the key word for it to be included on the list.
- Collocation measures: Here, you can select which association measures will be calculated and employed in the search for collocations and according to which the list should be sorted.
Once you are satisfied with your selection, you can click on the Make candidate list button. It should be noted here, that the interface does not provide you with a list of collocations, per se, but rather with candidates for collocations; as it was mentioned above, each measure is calculated differently and it is then up to the researcher to decide on the ir/relevancy of the potential collocate.
Try rearranging the list by sorting according to different association measures. We have selected logDice for our first sorting value. This measure is based only on the frequency of the node (key word) and the collocate and the frequency of the whole collocation; it is unaffected by the size of the corpus, it is thus suitable for comparing results from corpora of different sizes. The MI measure is prone to overestimating low frequency items; hence words like leetel and Scald-head appear in the first rows. The T-score measure, on the other hand, prefers words with high frequency, therefore the first rows are occupied mostly by function words, such as the, a and and, and punctuation. When sorting according to absolute frequency, the results will mostly coincide with the T-score measure.
Task:
Find out, which words frequently follow the adjectives modest and powerful.
- Make sure you have selected the OBC as your corpus.
- You can use the basic, word form or CQL query types.
- Set the range to 0 to 1 – this way you are looking only for the words which directly follow the node (key word).
- Sort by logDice.
You can find the solution here.
Modest:
Powerful: