Regular expressions

Regular expressions (the term comes from a theory of formal languages, but its meaning as it is used in IT is slightly different) allow us to accurately describe the set of text strings matching the search term or phenomenon. For these purposes, wildcards with special meanings are used. Regular expressions are an essential component of the query language. Basically, it is the addition of certain special characters with their own specific meaning into words that we want to search for.

A similar method can be used when searching with the help of the corpus manager, however this offers a much wider range of options. If we want to find all of the forms of the word learn, we do not want to write them out individually and we don’t even want to use lemmatization, it is possible to input the query learn.* – the period represents any character and the asterisk represents any number of repetitions of the preceding (i.e. arbitrary) character. However, it must be kept in mind that the manager finds all words beginning with learn, including for example the word learnability etc. The sequence .* represents any part of a word (or even tag) and is probably the most often used component of a regular expression. Naturally, when typing the query we can use the regular expressions anywhere - at the beginning, in the middle or at the end. The following special characters are used in the KonText interface:

  • period (.) – matches one single unspecified character,
  • interval ({n, k}) - represents n to k repetitions of the preceding character or larger unit; if k is omitted ({n,}), the interval matches a minimum of n repetitions, and if the interval is {n}, it matches exactly n repetitions;
  • asterisk (*) – indicates any number of repetitions, i.e. zero or more occurrences, of the preceding element (character or unit, making it equivalent to {0,}
  • plus (+) - indicates one or more occurrences of the preceding element, the same as {1,}
  • question mark (?) - indicates zero or one occurrences of the preceding element, identical to {0,1}
  • list ([]) – represents an alternative. It gives the option of choosing one arbitrary character from the set contained by the square brackets; if the first item on the list of characters is a caret (^), the list is negated and therefore matches one arbitrary character which is not one of those in the square brackets; it is also possible to use a hyphen in the list (-) as a range operator (e.g. [a-z], [1-9]),
  • vertical bar (|) – also represents an alternative, although not between individual characters, but entire strings forming a whole,
  • round brackets - any part of the expression can be placed into round brackets, creating a unit and in doing so influence the priority of its evaluation. It is also possible to apply to it the quantifiers mentioned above, which would otherwise apply only to one (preceding) character,
  • backslash (\) – if one of the special characters is preceded by a backslash, the given character loses its special functions (which enables us to e.g. find specific punctuation marks).

Examples of how regular expressions may be used can be found in the following table:

Example regular expression
all forms of the word sing sing.*
the word god with either a lowercase or uppercase first letter [gG]od
period as a punctuation mark \.
all prefixed derivations of the word activate .+activate
different lengths of the interjection haha ha(ha)+
two spelling variants of related forms: practise and practice practise|practice or practi[sc]e
any number consisting of three or four digits [0-9]{3,4}

Keyboard shortcuts

On the Czech keyboard of the MS Windows system, some special characters can be typed with the help of keyboard shortcuts (the most widely used shortcuts are available in the table below). In most other operating systems the special characters are usually accessible with the combination of AltGr (Linux), or Alt (Mac OS X), and the key on which the given character is usually located on the English language keyboard.

| vertical bar AltGr + Shift + under “Backspace” or Alt + W
{} curly brackets AltGr + 9, AltGr + 0 or Alt + B, Alt + N
[] square brackets Alt + F, Alt + G
^ caret Alt + š (or 3)
\ backslash Ctrl + Alt + Q