Lesson 5: Collocations

In this section we will have a look at collocations, i.e. meaningful, fixed, syntagmatic sequences of two (or more) words in the immediate proximity. In any language, every word has a tendency to cooccur with certain words more often than with the others. John R. Firth, an English linguist, famously said that “You shall know a word by the company it keeps” (Firth, J. R. 1957:11). What he meant was that the context in which the words occur can tell us more about them and the words themselves. If we think of the word tea the following phrases will come to mind: black tea, strong tea, cup of tea or tea leaves. These are all collocations of the word tea. Using the corpus, we can statistically prove that some words are more likely to collocate with the word tea while the others do not sound very native-like such as powerful tea, even though powerful is a synonym of strong.

Finding collocations of a word

Thanks to the corpus linguistics, anyone can easily identify which strings of words are established collocations and which are not. A collocation consists of a key word (a node which usually is also KWIC and a contextual word (collocate)).

Let's try finding collocations of a word such as bread. We select EEBO as the corpus we wish to work with and then use a basic query type. After clicking on the search button, concordance lines will appear. Bread, a key word, is always located in the middle of the line (higlighted in pink). We then click on the collocations button located in the upper menu and select custom from the dropdown menu.

Form for collocation candidates

First of all, we need to decide in which range we will be searching for collocations. The default range from -3 to 3 from KWIC is suitable for the majority of searches. If we wish to find out what kind of adjectives collocate with the word bread we need to set the range from -1 to -1. What this setting does is that it restricts the span to the first position to the left from the key word. Therefore we should be able to determine which adjectives frequently modifed the word bread in the period that EEBO covers.

Under the heading show functions: we can choose which measures of association we wish to be calculated for bread. Association measures are used to identify which collocations are statistically significant and which are merely coincidental. Each of the measures is sensitive to different kinds of phrases and each might not work in some cases. It is therefore recommended to combine the measures and compare their output.

For example, we can select the following measures:T-score, mutual information (MI) and log likelihood and select according to which association measure we wish to sort the results. The order of collocation candidates might vary depending on which association measure is selected. Afterwards we click on the make candidate list button.

If you sorted the list according to the log-likelihood, the following words should be in the first five positions of the list:

Collocation candidates for bread

Don't worry about the grammatical words and punctuation marks in the first positions. Function words such as prepositions and articles are the most common words in any language and therefore they frequently co-occur with any word, even bread. As the EEBO corpus is not lemmatized, it is not possible to restrict the search to adjectives and nouns only.

In the list of the first 50 collocation candidates, there are other words that frequently modify the word bread such as this, childrens, grated, common, Sacramental, leavened, consecrated, white or browne. Based on this selection of collocates, it is clear that the word is used in two different senses: as a type of food and altar bread used during the religious ceremonies.

Let's try searching for collocates of the following words in the EEBO corpus:

  • tea
  • war

We can always modify the range within which you wish to search for the collocates.

Association measures

Association measures are used to identify a collocation. Each of the measures is sensitive to different kind of phrases and each might not work in some cases. Consider the following table and see how much can the order of the collocate candidates vary depending on which association measure is selected.

Collocate Frequency MI Log likelihood T-score
1st collocate the Vnleauened of of
2nd collocate of vnleavened the the
3rd collocate this Unleavened unleavened this
4th collocate , unleavened daily -
5th collocate that Vnleavened , daily

How can we interpret these results?

  • MI prefers words with lower frequency and therefore the results include the word unleavened which is used almost exclusively with bread or other pastries such as cake, loaf or biscuit. In our example, we can see that the first five positions are filled with the spelling variants of the same word. Although the results are not absolutely satisfactory, they provide a proof of an established collocation such as unleavened bread.
  • T-score is based on the co-occurrence frequency and therefore the results of T-score and frequency almost coincide. This association measure prefers words with a high frequency and therefore there are mostly grammatical words and punctuation marks in the first positions. Established collocations may be found in the lower positions of the list.
  • Don't forget to always adjust the range in which you wish to search for collocates.
  • The negative numbers indicate the positions preceding the key word, while the positive ones refer to the right positions.
  • Minimum frequency in corpus: establishes minimum overall frequency of a unit in order to be included in the collocate list
  • Minimum frequency in given range: provided that we specified the context span for collocate search from -3 to 3, then the minimum frequency in given range option determines how frequently should an item co-occur with KWIC to be included in the collocate list

Look at the lists of words below. Using the EEBO corpus, find out which words collocate with the following three near synonyms: godly, divine or sacred?

Each of the synonyms is used in slightly different contexts as can be inferred from the three lists of collocates.

  • Set the range from -3 to 3
  • Sort by logDice
Near synonyms
1st collocate sorrowMajestyNature
2nd collocate learnedMajeſtyProvidence
3rd collocate manScripturesService
4th collocateMiniſtersWritRevelation
5th collocate menPersonhumane

If you are ready, you can continue to Lesson 6.