Form for creating a query

With the selection Query → New query, it is always possible to begin a new corpus search. By clicking on this option we abandon our previous query and any results produced with it, and we begin with a new search. The following text primarily deals with creating queries in single language corpora; the specifics of searching the parallel corpora InterCorp are described in detail in the bonus tutorial of the basic course on working with the CNC (Czech National Corpus).

After clicking on the item New query, a basic search menu appears for the user. Within this form it is possible to select a corpus in which the search will be conducted and enter a query in the input line below it. You can use the switch to activate the Advanced query function which works with the Corpus Query Language. The form also includes an interactive international keyboard for entering special characters (especially when searching in non-Czech texts and when inserting special characters of the CQL). Previously asked questions can be found either directly in the menu or using the Recent queries link above the query line. The last item in the toolbar above the line is Query interpretation, where the user finds out how his/her query will be evaluated (i.e. translated into CQL) and whether this interpretation is in accordance with his/her intention. This function is not available when switching to advanced mode, but instead it is possible to directly insert interactively generated morphological tags (for corpora that are tagged in this way) or within conditions (see items Insert tag and Insert within).

## Corpus selection

The selection of a corpus suitable for solving the given research question is an important decision which must be made before taking any other steps in the research. The range of corpora made available through the project of the Czech National Corpus is constantly widening, as can be seen on the list of corpora. It has therefore been necessary to adjust the corpus selection on the KonText interface in order to accommodate their growing number. Until the autumn of 2015, the corpus selection had the shape of a hierarchically organized tree; this system had several disadvantages: from the not always definite placement of the given corpus within the hierarchy, to a great increase in the number of new corpora and their versions. As a result of this, the hierarchical organization stopped being clear and sustainable in the future, which is why we have switched to the new, label-based system. Its aim is primarily to facilitate orientation in the large number of existing corpora, while simultaneously simplifying work for those users who use only a small number of favorite corpora.

After clicking on the name of the corpus (in the default setting it is always the most recent representative corpus of synchronic written Czech, currently SYN2010) a frame for the selection of a corpus appears, containing two main sections:

Corpus selection: featured and favorite corpora
1. My list with a quick, single click selection of corpora. This quick selection contains favourite corpora, which can be selected by the user, and also the so-called featured corpora: a default list of several corpora, which the CNC considers to be particularly important in the individual areas of production. Having them all in one place simplifies the selection of a corpus especially for beginning users of the CNC. Favourite corpora can be selected either on the page with all the available corpora, or when working with them at the time of query input (such corpora are labelled with a yellow star).
2. All corpora with the possibility of searching all available corpora with the aid of so-called labels, which are used to characterize the corpora (a typical corpus has several labels, e.g. SYN2010: written, synchronic, Czech, SYN family, representative). For example, if you are looking for a web corpus of Czech, all you have to do is select the labels “Czech” + “web”, and all the relevant corpora made available by the CNC will appear. The search may be further refined by typing part of the corpus name or its description into the search bar, and the resulting list of corpora is interactively filtered based on the keywords. However, it must be noted that for spatial reasons the list only shows the first 25 items; if the list is too long, the query must be specified further with the addition of another label, or by searching for a part of its name.

Example: The user searches in the All corpora tab for a current version of the English section of the parallel corpus InterCorp. He first selects the labels “InterCorp” a “current version ”, the first 25 corpora which conform to the specified conditions appear on the list, although InterCorp contains many more languages. The corpora not displayed can be accessed with the help of further filtering, for example by typing part of the corpus name or language (please note that the names of the individual InterCorp versions are in English). After finding the desired corpus and clicking on it, the corpus becomes the current corpus for searching, and it is at the same time possible to mark it as favourite with a star. The corpus is added to the list of favourite corpora and can be accessed quickly and easily with a single click.

## Query types

There are two query types in the current version of KonText: simple a advanced.

Previous versions of KonText worked with six query types: basic, lemma, phrase, word form, character, and CQL. The current simple query covers the first five query types, as their functionality can be achieved by altering the simple query settings, e.g. the default attribute and/or regular expressions (see below). The current advanced query fully corresponds to the CQL query type.

The default setting is the simple query with case-insensitive matching (the Match case switch is off), without regular expressions (the Allow regular expressions switch is off) and with the default attribute set to lemma|word (lemma|sublemma|word in SYN2020). The latter setting denotes searching not only for the input word form (given by the word attribute), but also other word forms subsumed under lemma or sublemma, provided the input word can also be interpreted as a lemma or sublemma (remark: this is exactly the behaviour of the basic query of the previous versions of KonText, that has only been extended to sublemma). Apart from the individual words, it is also possible to input multi-word phrases. The search can be further specified either by using the add-on for suggesting variants (SYN2020 only, see next section), or by changing the default attribute, and/or toggling the case-sensitivity switch. Furthermore, regular expressions can be used in the simple query mode after toggling the Allow regular expressions switch.

Advanced query mode is activated by a switch above the input line. This mode fully corresponds to the CQL query mode of the previous versions of KonText. When entering a CQL query, KonText automatically checks and highlights the query syntax. If the syntax is not valid, it notifies the user and lets them edit the query. Since CQL is quite complex, the syntax checking may occasionally issue a warning also in case of a valid query.

Once the query has been entered, the search can be started either by clicking on the Search button, or by pressing the Enter key (provided the focus is on the input line).

## Suggesting other variants of the input word

Suggesting other variants of the input word

For corpora with two-level lemmatization (currently only the SYN2020 corpus), there is also a special add-on available for suggesting other possible variants of the input word. This is indicated by a blue background of the word and a little question mark that appears next to it. After clicking on it while pressing the Ctrl/Command key, all possible variants of lemmas and sublemmas appear. After the selection of the individual variant, its (sub)lemma interpretation is changed in the query. This is indicated by a red background of the Query interpretation option above the input line.

For instance, when a user types in the word filozof, the add-on notifies them that this lemma includes two spelling variants, filozof and filosof, as sublemmas. It is then up to the user which variants to include in their query. Similarly, the add-on notifies the user in the case of lemmas that differ only in case, e.g. Procházka (common surname) vs. procházka (a walk).

## Specify parameters

As mentioned above, one can also specify additional parameters that influence the interpretation of a query. Apart from the default positional attribute, there are two more switches available in the simple query mode: case-sensitivity and allowing the use of regular expressions in a query.

## Specify context

Form for searching in context

Every query can be further specified based on the context (the surrounding text) in which the search word or phrase can appear. The context menu, which is used for the specification of a query, can be found in the bottom section of the query form (it is hidden in the basic settings and it is necessary to activate it by clicking on Specify context).

Searching the context essentially means additionally filtering the basic concordance which is specified by the query already in the query form. The user can set the span of the context to which the additional filter condition will be applied, particular lemma(s), or word class(es).

In general, it can be said that any given search in context can be rewritten as an ordinary query which is then filtered (with the aid of positive and negative filters). Any filtering can also be carried out with the help of query language, performing the identical operation in a single step. The reality is that there are always more ways to achieve the same result, and it is entirely up to the user to decide which option he/she is most comfortable with.

If we need to search only in a narrowly defined group of texts in the entire corpus, we have two options. Either we create our own virtual subcorpus, which we will then be able to select within the offered corpora, or we can restrict the query with a number of conditions (typically with the command within). As a rule, we choose the first option in situations where we know that we will be needing the subcorpus for a longer duration of time, or when its specification is complex. We use the second option when conducting ad hoc searches within some clearly defined text categories, which are specified with the help of the basic structural attributes.

The New Query form allows for simplification by way of an additional Restrict search tab which is located underneath the context search and is activated with a click, similar to the (above mentioned) context specification.

Form for searching in a subcorpus created ad hoc

Within this form it is possible to mark off the values of selected structural attributes that interest us. The form does not contain all structural attributes, but only those most often used in the given corpus (e.g. when searching in the SYN2020 it is txtype_group, txtype, genre, srclang). The abbreviations used can be found in the lists section.

In the final column we can find a list of the specific opuses or documents (based on the selected corpus), which correspond to the specified condition. If such a list would be too long, the given column contains only the number of items. If we select some categories from the menu, we can view an inventory of texts which meet the given conditions with the help of the button refine selection (bottom left). The column containing the list of texts is recalculated according to the currently marked criteria. We can continue in this way until we are satisfied with the demarcation of the data that we want to use for our search.

For a more detailed specification it is necessary to either use the condition within inside a CQL query, or to create a new virtual subcorpus.

# Recent queries

The item displays an overview of the most recent queries used (a simplified list of previous queries is also accessible directly from the query form, via a link above the input line). These queries can be filtered according to the query type or the currently used corpus, and only archived queries can be viewed as well. By clicking on the link Edit and search, we paste a previously specified constraints into the query form and we may either use it without any changes, or we may modify it further (e.g. change the corpus in which the query will be used, the query type, or we may specify the context). By clicking on the Archive option, we can name the query and permanently save it to the query history archive for later reuse.

# Word list

The basic output of any query is a concordance, i.e. a list of all the occurrences (tokens) matching the query, along with their text surroundings. The Word list function evaluates the query in such a way that the result is a list of various words (types), matching the query, together with their absolute frequency, ARF or number of documents in which the wanted phenomenon occurs. In this respect, the Word list function is analogous to frequency distribution, however its advantage is its speed and low computational complexity, because the extra step involving the concordance is not needed with the Word list.

Form for creating word lists

Various search parameters can be set in the form:

• corpus (or its subcorpus), in which the word list will be created
• attribute (positional or structural), which is to be included in the list
• RE pattern (regular expression), to which the resulting words must correspond (if it is not submitted, the list will contain all items in the corpus if they fulfill the other specifications in the form)
• minimum frequency
• whitelist – a list of pre-selected words (in a separate file) which we want to see in the resulting list
• blacklist – a list of pre-selected words (in a separate file) which we want to exclude from the resulting list
• option “Include non-words”, which widens the search to words which are not composed only of alphabetic characters

Among the output option settings we can find a selection of either the absolute frequency, ARF or a document count. Furthermore there is also the possibility to choose a specific output attribute (or attributes). These attributes need not be identical to the positional attribute selected in the top section of the form, on which all the above mentioned filters are applied. This enables us to create e.g. a frequency list of all verbs by selecting the attribute tag in the top section, applying the condition for a verb as in V.* and finally by “switching” the output type to lemma – an example of such a query is shown in the picture.