Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:obc:collocations [2020/02/17 16:52] – jankocek | en:obc:collocations [2020/02/19 14:03] (current) – michalskrabal |
---|
In this lesson, we will focus on [[https://wiki.korpus.cz/doku.php/en:manualy:kontext:kolokace|collocations]], i.e. meaningful, fixed, syntagmatic sequence of two (or more) words in the immediate proximity. The KonText interface allows us to create collocation lists for the given word, which enables us to determine in what contexts the word or phenomenon typically occurs. | In this lesson, we will focus on [[https://wiki.korpus.cz/doku.php/en:manualy:kontext:kolokace|collocations]], i.e. meaningful, fixed, syntagmatic sequence of two (or more) words in the immediate proximity. The KonText interface allows us to create collocation lists for the given word, which enables us to determine in what contexts the word or phenomenon typically occurs. |
| |
To identify a collocation, [[http://www.collocations.de/AM/index.html|association measures]] are used. The interface employs the following association measures: t-score, MI, MI3, log likelihood, min. sensitivity, logDice, MI.log_f, relative frequency. It is recommended to combine the measures, as each functions differently and favours different kinds of associations, not every single one may be suitable every time. Additionally, the scores produced by different association measures cannot be directly compared. | To identify a collocation, [[http://www.collocations.de/AM/index.html|association measures]] are used. The interface employs the following association measures: //t-score//, //MI//, //MI3//, //log likelihood//, //min. sensitivity//, //logDice//, //MI.log_f//, //relative frequency//. It is recommended to combine the measures, as each functions differently and favours different kinds of associations, not every single one may be suitable every time. Additionally, the scores produced by different association measures cannot be directly compared. |
| |
**Searching the corpus** | **Searching the corpus** |
To search for collocations, it is necessary to create a concordance of the word the collocations of which we want to find. In the case of the OBC, you may be interested in, for example, the words or phrases used to describe men or women in the proceedings, or whether particular attributes were more commonly ascribed to a specific gender (to read more about gender in the proceedings, visit [[https://www.oldbaileyonline.org/static/Gender.jsp?fbclid=IwAR3qoxZI0K2DLIednXk1F8ps4escY1a-CFKQPeslZWnActoXt6No2X2hjcA#researchinggender|here]]). | To search for collocations, it is necessary to create a concordance of the word the collocations of which we want to find. In the case of the OBC, you may be interested in, for example, the words or phrases used to describe men or women in the proceedings, or whether particular attributes were more commonly ascribed to a specific gender (to read more about gender in the proceedings, visit [[https://www.oldbaileyonline.org/static/Gender.jsp?fbclid=IwAR3qoxZI0K2DLIednXk1F8ps4escY1a-CFKQPeslZWnActoXt6No2X2hjcA#researchinggender|here]]). |
| |
Let’s create a concordance of the word //boy//. Select the OBC from the corpora list and set the query type on Basic. Search for the form “boy” and when the concordance appears, click on Collocations in the top menu, then click on Custom. A form for creating a collocations list appears, in which you can specify the values used for searching for the collocations. | Let’s create a concordance of the word //boy//. Select the OBC from the corpora list and set the query type on //Basic//. Search for the form //boy// and when the concordance appears, click on //Collocations// in the top menu, then click on //Custom//. A form for creating a collocations list appears, in which you can specify the values used for searching for the collocations. |
| |
- **Attribute:** You can select either ''word ''or //tag//; the collocation list will then consist of either specific words or part-of-speech [[http://ucrel.lancs.ac.uk/claws7tags.html|tags]]. | - **Attribute:** You can select either //word// or //tag//; the collocation list will then consist of either specific words or part-of-speech [[http://ucrel.lancs.ac.uk/claws7tags.html|tags]]. |
- **Collocation window span:** Specifies the proximity to the key word, the default value is -3 to 3, which means all the words which occur in the first, second and third positions to the left and to the right of the key word will be considered. | - **Collocation window span:** Specifies the proximity to the key word, the default value is -3 to 3, which means all the words which occur in the first, second and third positions to the left and to the right of the key word will be considered. |
- **Minimum collocate frequency in the corpus:** Determines the least number of occurrences in the concordance for the word/tag to be included on the collocations list. The default minimum frequency is 3, which means that forms with fewer occurrences in the concordance will not be included in the list of collocates. | - **Minimum collocate frequency in the corpus:** Determines the least number of occurrences in the concordance for the word/tag to be included on the collocations list. The default minimum frequency is 3, which means that forms with fewer occurrences in the concordance will not be included in the list of collocates. |
- **Minimum collocate frequency in the span:** Determines how frequently an item should co-occur with the key word for it to be included on the list. | - **Minimum collocate frequency in the span:** Determines how frequently an item should co-occur with the key word for it to be included on the list. |
- **Collocation measures:**Here, you can select which association measures will be calculated and employed in the search for collocations and according to which the list should be sorted. | - **Collocation measures:** Here, you can select which association measures will be calculated and employed in the search for collocations and according to which the list should be sorted. |
| |
{{:en:obc:l8_1.png?direct&600|}} | {{:en:obc:l8_1.png?direct&400|}} |
| |
Once you are satisfied with your selection, you can click on the ''Make candidate list ''button. It should be noted here, that the interface does not provide you with a list of collocations, per se, but rather with ''candidates ''for collocations; as it was mentioned above, each measure is calculated differently and it is then up to the researcher to decide on the ir/relevancy of the potential collocate. | Once you are satisfied with your selection, you can click on the //Make candidate list// button. It should be noted here, that the interface does not provide you with a list of collocations, per se, but rather with //candidates// for collocations; as it was mentioned above, each measure is calculated differently and it is then up to the researcher to decide on the ir/relevancy of the potential collocate. |
| |
{{:en:obc:l8_2.png?direct&600|}} | {{:en:obc:l8_2.png?direct&400|}} |
| |
Try rearranging the list by sorting according to different association measures. We have selected **logDice** for our first sorting value. This measure is based only on the frequency of the node (key word) and the collocate and the frequency of the whole collocation; it is unaffected by the size of the corpus, it is thus suitable for comparing results from corpora of different sizes. The '''MI '''measure is prone to overestimating low frequency items; hence words like ''leetel ''and //Scald-head// appear in the first rows. The '''T-score '''measure, on the other hand, prefers words with high frequency, therefore the first rows are occupied mostly by function words, such as //the//, //a// and //and//, and punctuation. When sorting according to absolute frequency, the results will mostly coincide with the '''T-score '''measure. | Try rearranging the list by sorting according to different association measures. We have selected //logDice// for our first sorting value. This measure is based only on the frequency of the node (key word) and the collocate and the frequency of the whole collocation; it is unaffected by the size of the corpus, it is thus suitable for comparing results from corpora of different sizes. The MI measure is prone to overestimating low frequency items; hence words like //leetel// and //Scald-head// appear in the first rows. The T-score measure, on the other hand, prefers words with high frequency, therefore the first rows are occupied mostly by function words, such as //the//, //a// and //and//, and punctuation. When sorting according to absolute frequency, the results will mostly coincide with the T-score measure. |
| |
<HTML> | <WRAP round help 40%> |
</div> | **Task:** |
<div style="margin-left:0.635cm;margin-right:0cm;"> | |
</HTML> | |
| |
**Task** | Find out, which words frequently follow the adjectives //modest// and //powerful//. |
| |
<HTML> | * Make sure you have selected the OBC as your corpus. |
</div> | * You can use the basic, word form or CQL query types. |
<div style="margin-left:0.635cm;margin-right:0cm;"> | * Set the range to 0 to 1 – this way you are looking only for the words which directly follow the node (key word). |
</HTML> | * Sort by logDice. |
Find out, which words frequently follow the adjectives //modest// and //powerful// | </WRAP> |
| |
<HTML> | You can find the solution [[en:obc:solution#lesson_8|here]]. |
</div> | |
</HTML> | |
* Make sure you have selected the OBC as your corpus | |
| |
* You can use the basic, word form or CQL query types | |
* Set the range to 0 to 1 – this way you are looking only for the words which directly follow the node (key word) | |
* Sort by **logDice** | |
| |
<HTML> | |
<div style="margin-left:0.635cm;margin-right:0cm;"> | |
</HTML> | |
//Modest:// | |
| |
<HTML> | |
</div> | |
<div style="margin-left:0.635cm;margin-right:0cm;"> | |
</HTML> | |
{{Obrázek_7.png|Obrázek_7.png Obrázek_7.png}} | |
| |
<HTML> | |
</div> | |
<div style="margin-left:0.635cm;margin-right:0cm;"> | |
</HTML> | |
//Powerful:// | |
| |
<HTML> | |
</div> | |
<div style="margin-left:0.635cm;margin-right:0cm;"> | |
</HTML> | |
{{Obrázek_8.png|Obrázek_8.png Obrázek_8.png}} | |
| |
<HTML> | |
</div> | |
</HTML> | |
| |
| And that´s it. [[en:obc:start|Back to the OBC main page]] |