AplikaceAplikace
Nastavení

This is an old revision of the document!


Lesson 4: Spelling III (Searching with tags)

The OBC was part-of-speech tagged using the CLAWS 7 tagset. Each word, which is here defined as an uninterrupted string of characters, excluding apostrophes and hyphens, delimited by punctuation or white space, is assigned a tag which specifies the part of speech identified in the given context. This is an automatic process; you may encounter some inaccuracies but the number of them should be fairly minimal. For more information, see the OBC Manual.

It is important to keep in mind that synthetic genitives such as mother’s or contracted forms like don’t are counted as two words, since the CLAWS system transforms these into separate items. (For mother’s, the query then must be written as follows: [word=“mother”] [word=“'s”])

These items then have their own special tag, GE and XX respectively. However, past and past participle forms involving apostrophes such as cry’d are counted as one word.

Searching with tags

In the previous lessons, you have worked with individual specific words. Searching using tags allows you to look at the given phenomena as it occurs across whole classes of words.

Let’s take a look at the contracted past tense and participle ending ‘d. By examining the tagset, we can see that it distinguishes all different types of verbs; modal auxiliaries are tagged as VM, infinitive forms as VVI and so on. In the standard tagest, there is no specific tag for verbs in the forms we are looking for. The easiest way to find the past and past participle contracted forms would then be to search for all verbs that end in ‘d. To do so, use regular expressions; the sequence .* in particular. The full stop . represents any one character, while the asterisk * matches zero, one or more repetitions of the previous character. The sequence then represents any part of a word or a tag. We can see in the tagset that all verb tags begin with V and we can substitute the rest of the tag with the regular expression .* to match any verbal tag.

Make sure the query type is set on CQL, you may also set the default attribute below the search window to tag, however it is not necessary. If you do so, the square brackets and the specified attribute can be left out in the query (i.e. you can type only “V.*” into the search box). You may start the query as such:

[tag=“V.*”]

This query alone would find all verbs in the corpus, but what we need is to limit the search to only the verbs which end with ‘d. For this, you can make use of the ampersand symbol (&) which represents the function of AND. When you connect two or more attributes with &, the resultant concordance will include only those occurrences which fulfil all the conditions specified in the query. The second part of the query is the word attribute; to look for any word which ends with ‘d, we can use another regular expression. This time, we want to use the symbol + instead of *, since + represents one or more repetitions of the previous character; this way we avoid the possibility of only ‘d appearing in the concordance.

[tag=“V.*” & word=“.+'d”]

With this query, we are searching for all words which are tagged as verbs and which at the same time end with ‘d. The number of hits is 51,705 and the relative frequency is 1,459.09.

To view the tags of any of the words included in the concordance, hover over the individual words or elements.

Here, the word something is tagged as PN1, which corresponds to indefinite pronoun, singular in the tagset.

You may change this setting by clicking on View → Corpus-specific settings and selecting a different option listed under How to display additional positional attributes?.

You may have noticed that the forms you searched for (KWIC) are tagged as VVX. This tag is not a part of the standard CLAWS 7 tagset but it was added during the tagging process specifically to the OBC. To read more about the corrections, see the OBC Manual, page 12. Hence, it is recommendable not to always rely on the tagest only, but rather to check the actual tagging in the given string and build your query according to that.

Let’s check the frequency list (Frequency → Node forms [A=a]) to see which verbs are most commonly contracted in this way.

To compare the frequency of the contracted forms with the full forms, let’s do a quick search for the full forms of the top four most frequent contracted verbs:

[word=“deposed” | word=“asked” | word=“called” | word=“robbed”]

Go to the frequency list (Frequency → Node forms [A=a]) and compare:

Task:

  • Try to find all plural nouns in the genitive case which are formed with the ‘s suffix
  • Keep in mind the different tags for different classes of nouns
  • Make sure the query type is set to CQL
  • Notice the spelling conventions – can you find an example in which the genitive ‘s follows the plural -s? How frequent is it?

Solution in KonText here:

Query: [tag=“N.*2”][tag=“GE”]

prisoners’s 14x, prosecutors’s 10x

Proceed to Lesson 5.