AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:eebo:collocations [2016/09/27 05:06] – [Association measures] kristinavalentinyovaen:eebo:collocations [2018/07/30 14:49] (current) vaclavcvrcek
Line 8: Line 8:
 Thanks to the corpus linguistics, anyone can easily identify which strings of words are established collocations and which are not. A collocation consists of a key word (a node which usually is also [[en:pojmy:kwic|KWIC]] and a contextual word (collocate)).  Thanks to the corpus linguistics, anyone can easily identify which strings of words are established collocations and which are not. A collocation consists of a key word (a node which usually is also [[en:pojmy:kwic|KWIC]] and a contextual word (collocate)). 
  
-Let's try finding collocations of a word such as //bread//. We select EEBO as the corpus we wish to work with and then use a basic query type. After clicking on the search button, concordance lines will appear. //Bread//, a key word, is always located in the middle of the line (higlighted in pink). We then click on the **collocations** button located in the upper menu and select **custom** from the dropdown menu.+Let's try finding collocations of a word such as //bread//. We select [[en:cnk:eebo|EEBO]] as the corpus we wish to work with and then use a basic query type. After clicking on the search button, concordance lines will appear. //Bread//, a key word, is always located in the middle of the line (higlighted in pink). We then click on the **collocations** button located in the upper menu and select **custom** from the dropdown menu.
  
 [{{eebo-9.png?500|Form for collocation candidates}}] [{{eebo-9.png?500|Form for collocation candidates}}]
Line 27: Line 27:
  
 <WRAP round tip 40%> <WRAP round tip 40%>
-Don't worry about the grammatical words and punctuation marks in the first positions. Function words such as prepositions and articles are the most common words in any language and therefore they frequently co-occur with //bread//. As the EEBO corpus is not lemmatized, it is not possible to restrict the search to adjectives and nouns only.+Don't worry about the grammatical words and punctuation marks in the first positions. Function words such as prepositions and articles are the most common words in any language and therefore they frequently co-occur with any word, even //bread//. As the EEBO corpus is not lemmatized, it is not possible to restrict the search to adjectives and nouns only.
 </WRAP> </WRAP>
  
Line 35: Line 35:
   * tea    * tea 
   * war   * war
-You can modify the range within which you wish to search for the collocates.+We can always modify the range within which you wish to search for the collocates.
 </WRAP> </WRAP>
  
 ======= Association measures ======= ======= Association measures =======
  
-Association measures are used to identify a collocation.  Each of the measures is sensitive to different kinds of phrases and each might not work in some cases. Consider the following table and  see how  much can the order of collocate candidates vary depending on which  association measure is selected.+Association measures are used to identify a collocation.  Each of the measures is sensitive to different kind of phrases and each might not work in some cases. Consider the following table and  see how  much can the order of the collocate candidates vary depending on which  association measure is selected.
  
 ^ Collocate ^  Frequency  ^  MI  ^  Log likelihood  ^ T-score ^ ^ Collocate ^  Frequency  ^  MI  ^  Log likelihood  ^ T-score ^
Line 50: Line 50:
  
 How can we interpret these results? How can we interpret these results?
-  * **MI** prefers words with lower frequency and therefore the results may be biased. In our example we can see that the first five positions are filled with the spelling variants of the same word. Although the results are not absolutely satisfactory, they provide a proof of an established collocation such as //unleavened bread//.+  * **MI** prefers words with lower frequency and therefore the results include the word //unleavened// which is used almost exclusively with //bread// or other pastries such as //cake//, //loaf// or //biscuit//. In our examplewe can see that the first five positions are filled with the spelling variants of the same word. Although the results are not absolutely satisfactory, they provide a proof of an established collocation such as //unleavened bread//.
   * **T-score** is based on the co-occurrence frequency and therefore the results of T-score and frequency almost coincide. This association measure prefers words with a high frequency and therefore there are mostly grammatical words and punctuation marks in the first positions. Established collocations may be found in the lower positions of the list.   * **T-score** is based on the co-occurrence frequency and therefore the results of T-score and frequency almost coincide. This association measure prefers words with a high frequency and therefore there are mostly grammatical words and punctuation marks in the first positions. Established collocations may be found in the lower positions of the list.
  
Line 58: Line 58:
   * The negative numbers indicate the positions preceding the key word, while the positive ones refer to the right positions.   * The negative numbers indicate the positions preceding the key word, while the positive ones refer to the right positions.
   * Minimum frequency in corpus: establishes minimum overall frequency of a unit in order to be included in the collocate list   * Minimum frequency in corpus: establishes minimum overall frequency of a unit in order to be included in the collocate list
-  * Minimum frequency in given range: provided that we specified the context span for collocate search from -3 to 3, then the minimum frequency in given range optiom determines how frequently should an item co-occur with KWIC to be included in the collocate list+  * Minimum frequency in given range: provided that we specified the context span for collocate search from -3 to 3, then the minimum frequency in given range option determines how frequently should an item co-occur with KWIC to be included in the collocate list
 </WRAP> </WRAP>
  
 <WRAP round help 40%> <WRAP round help 40%>
-Look at the lists of words below. Using the EEBO corpus, find out which collocate with the following three near synonyms: //godly, divine or sacred//? Each of the synonyms is used in slightly different contexts as can be inferred  from the three lists of collocates.+Look at the lists of words below. Using the EEBO corpus, find out which words collocate with the following three near synonyms: //godly, divine or sacred//?  
 +</WRAP> 
 + 
 +Each of the synonyms is used in slightly different contexts as can be inferred  from the three lists of collocates.
   * Set the range **from -3 to 3**   * Set the range **from -3 to 3**
   * Sort by **logDice**   * Sort by **logDice**
-</WRAP>+
 ^Near synonyms ^  ^    ^ ^Near synonyms ^  ^    ^
 ^1st collocate |sorrow|Majesty|Nature| ^1st collocate |sorrow|Majesty|Nature|
Line 73: Line 76:
 ^5th collocate |men|Person|humane| ^5th collocate |men|Person|humane|
  
 +----
 +
 +**If you are ready, you can continue to [[en:eebo:morphology1|Lesson 6]].**
  
 +----