Query languages are used to query database systems in information technologies; every system uses a query language with precisely defined syntax.
For work with language corpora, the query language is used for inputting queries into corpus managers, concordance programs etc. Even here the individual languages usually differ, despite the fact that they are all based on regular expressions, which they then expand on and adapt to fit their individual needs.
The query language used in the ČNK corpora operating on the corpus manager Manatee is called CQL (corpus query language) and is in fact a modified version of the original CQL created for the corpus manager CWB. Its cornerstone is a query for a single position (word) in the corpus:
[attribute="value"]
where the attribute is positional (word, lemma, tag etc.), the value is the search term itself, or a pattern specified with the help of regular expressions. The query can also include limitations on structural attributes (sentence, doc, opus), where it is also possible to specify other values (e.g. for opuses it is the publication year, genre, author etc.). Limitations for structural attributes are, unlike those for positional attributes, written in in pointed brackets (e.g. <s id="10"/>
); see a more detailed and complete description of the CQL . CQL is a formal language which has a precise (and finite) definition. CQL supports some elements of traditional regular languages 1), but it also supports expanded, specifically corpus-related commands such as within
, meet
, union
or containing
, which work with the structure of the corpus.
A simultaneous query for more than one position (i.e. word sequence or wider context) is formed simply by the concatenation of the individual queries for each successive position. E.g. the query [lemma="have"][][lemma="heart"]
searches for all occurrences of the lemmas have and heart, in between which there is one position (i.e. word or punctuation).
The following example of the Manatee corpus manager's query language will find all instances of the construction type „neither woman nor man“, „neither man nor beast“ etc. occurring in the corpus within one sentence (structure<s/>
, see structural attributes):
[lemma="neither"] [tag="N.*"] []{0,1} [lemma="nor"] [tag="N.*"] within <s/>
Each position in the sequence is represented by one pair of square brackets, possibly accompanied by a quantifier in curly brackets. The first position represents all words lemmatized as „neither“, the second position represents all nouns (word forms containing a morphological tag beginning with the letter „N“, followed by an arbitrary sequence of arbitrary characters), the third position is occupied by any one word (or none), the fourth position is limited to the lemma „nor“, and the fifth position once again contains the morphological tag for nouns. The directive „within“ limits the entire query within the scope of one structural attribute „<s/>“ (i.e. one sentence). It is also possible to use the directive containing
for this particular purpose.
For work with a corpus manager it is advisable to know the query language used and the possibilities it offers. Although some user interfaces make it possible to input queries without knowledge of the specific query language, the possibilities of working with such an interface tend to be somewhat limited. This is a result of the effort to make the interface user-friendly and as comprehensible as possible, which is always achieved at the expense of the possibilities and combinations available to the user.