Obsah

Regular expressions

Regular expressions (the term comes from a theory of formal languages, but its meaning as it is used in IT is slightly different) allow us to accurately describe the set of text strings matching the search term or phenomenon. For these purposes, wildcards with special meanings are used. Regular expressions are an essential component of the query language. Basically, it is the addition of certain special characters with their own specific meaning into words that we want to search for.

A similar method can be used when searching with the help of the corpus manager, however this offers a much wider range of options. If we want to find all of the forms of the word learn, we do not want to write them out individually and we don’t even want to use lemmatization, it is possible to input the query learn.* – the period represents any character and the asterisk represents any number of repetitions of the preceding (i.e. arbitrary) character. However, it must be kept in mind that the manager finds all words beginning with learn, including for example the word learnability etc. The sequence .* represents any part of a word (or even tag) and is probably the most often used component of a regular expression. Naturally, when typing the query we can use the regular expressions anywhere - at the beginning, in the middle or at the end. The following special characters are used in the KonText interface:

Examples of how regular expressions may be used can be found in the following table:

Example regular expression
all forms of the word sing sing.*
the word god with either a lowercase or uppercase first letter [gG]od
period as a punctuation mark \.
all prefixed derivations of the word activate .+activate
different lengths of the interjection haha ha(ha)+
two spelling variants of related forms: practise and practice practise|practice or practi[sc]e
any number consisting of three or four digits [0-9]{3,4}

Keyboard shortcuts

On the Czech keyboard of the MS Windows system, some special characters can be typed with the help of keyboard shortcuts (the most widely used shortcuts are available in the table below). In most other operating systems the special characters are usually accessible with the combination of AltGr (Linux), or Alt (Mac OS X), and the key on which the given character is usually located on the English language keyboard.

| vertical bar AltGr + Shift + under “Backspace” or Alt + W
{} curly brackets AltGr + 9, AltGr + 0 or Alt + B, Alt + N
[] square brackets Alt + F, Alt + G
^ caret Alt + š (or 3)
\ backslash Ctrl + Alt + Q