Form for creating a query

With the selection Query → New query, it is always possible to begin a new corpus search. By clicking on this option we abandon our previous query and any results produced with it, and we begin with a new search. The following text primarily deals with creating queries in single language corpora; the specifics of searching the parallel corpora InterCorp are described in detail in the bonus tutorial of the basic course on working with the CNC (Czech National Corpus).

After clicking on the item New Query, a basic search menu appears for the user. Within this form it is possible to select a corpus in which the search will be conducted, and the type of query which will be used. The query itself is inserted into the input line. Another part of the form is an interactive international keyboard for writing special characters (especially when searching in non-Czech texts and when inserting special characters of the Corpus Query Language, CQL). Previously used queries can be found under the link Previous Queries, located above the query line.

## Corpus selection

The selection of a corpus suitable for solving the given research question is an important decision which must be made before taking any other steps in the research. The range of corpora made available through the project of the Czech National Corpus is constantly widening, as can be seen on the list of corpora. It has therefore been necessary to adjust the corpus selection on the KonText interface in order to accommodate their growing number. Until the autumn of 2015, the corpus selection had the shape of a hierarchically organized tree; this system had several disadvantages: from the not always definite placement of the given corpus within the hierarchy, to a great increase in the number of new corpora and their versions. As a result of this, the hierarchical organization stopped being clear and sustainable in the future, which is why we have switched to the new, label-based system. Its aim is primarily to facilitate orientation in the large number of existing corpora, while simultaneously simplifying work for those users who use only a small number of favorite corpora.

After clicking on the name of the corpus (in the default setting it is always the most recent representative corpus of synchronic written Czech, currently SYN2010) a frame for the selection of a corpus appears, containing two main sections:

Corpus selection: featured and favorite corpora
1. My list with a quick, single click selection of corpora. This quick selection contains favourite corpora, which can be selected by the user, and also the so-called featured corpora: a default list of several corpora, which the CNC considers to be particularly important in the individual areas of production. Having them all in one place simplifies the selection of a corpus especially for beginning users of the CNC. Favourite corpora can be selected either on the page with all the available corpora, or when working with them at the time of query input (such corpora are labelled with a yellow star).
2. All corpora with the possibility of searching all available corpora with the aid of so-called labels, which are used to characterize the corpora (a typical corpus has several labels, e.g. SYN2010: written, synchronic, Czech, SYN family, representative). For example, if you are looking for a web corpus of Czech, all you have to do is select the labels “Czech” + “web”, and all the relevant corpora made available by the CNC will appear. The search may be further refined by typing part of the corpus name or its description into the search bar, and the resulting list of corpora is interactively filtered based on the keywords. However, it must be noted that for spatial reasons the list only shows the first 25 items; if the list is too long, the query must be specified further with the addition of another label, or by searching for a part of its name.

Example: The user searches in the tab All corpora for a current version of the English section of the parallel corpus InterCorp. He first selects the labels “InterCorp” a “current version ”, the first 25 corpora which conform to the specified conditions appear on the list, although InterCorp contains many more languages. The corpora not displayed can be accessed with the help of further filtering, for example by typing part of the corpus name or language (please note that the names of the individual InterCorp versions are in English). After finding the desired corpus and clicking on it, the corpus becomes the current corpus for searching, and it is at the same time possible to mark it as favourite with a star. The corpus is added to the list of favourite corpora and can be accessed quickly and easily with a single click.

## Type of Query

Query types and their uses

Query type What it’s for How it works What it does Examples
Basic Query for familiarization with the corpus Searches for the expression as a node form regardless of case; in the case of a dictionary form (lemma), all of its possible forms are also searched for. without regular expressions (RE), case-insensitive old house > old house, older houses, oldest house…
this time > this time
Lemma for the analysis of an entire paradigm/lexeme Finds all forms associated with the given lemma. RE (it is possible to use regular expressions), case-sensitive, possibility of specifying word class see > see, saw, seen, seeing…
new > new, newer, newest…
Phrase for a multiword combination in the given form Finds the exact wording of a phrase. RE, case-sensitive black dog > black dog
new car > new car
newer car > newer car
Node form for the analysis of one specific form Finds the exact form. RE, case-in/sensitive (possible to select Match case) cat > cat
cats > cats
cat.* > cat, cats, Cats, CATS…
Character searching for a string of characters anywhere within a word Finds consecutive characters within the scope of one word. RE, case-sensitive pre > prepare, supreme, appreciate…
str > industrial, strong, orchestra…
CQL searching for anything that can be found with the corpus manager CQL is Corpus Query Language (into which the KonText interface internally converts all of the previous types of queries). RE, case-sensitive, CQL syntax [lemma="see"] > see, saw, seen, seeing…
[word="black"] > black
[lemma="read"][tag="N.*"] > reading books, read something…

Corpus selection and query type can influence what the form looks like:

1. Corpora which are not lemmatized, i.e. do not offer lemma as a query type.
2. Some query types (only those where it makes sense) allow for the user to specify whether the query should be assessed with respect to capitalization (case-sensitive), or without considering upper/lower case (case-insensitive).
3. In the case of the query types lemma and word it is also possible to specify word class (position attribute pos).
4. The CQL query type also allows for the insertion of interactively generated morphological tags (with corpora which are tagged in this way) or conditions specifying texts in which the search is to be carried out (the condition within).
5. A very specific way of inputting queries for searches in parallel corpora.

Once the query has been typed, we may begin the search either by clicking on the Search button, or by pressing the Enter key, if the cursor is in the input line.

Every query can be specified further for the context in which the term is found and documents in which we want to search.

## Specify context

Form for searching in context

Every query can be further specified based on the context (the surrounding text) in which the search word or phrase can appear. The context menu, which is used for the specification of a query, can be found in the bottom section of the query form (it is hidden in the basic settings and it is necessary to activate it by clicking on Specify context).

Searching the context essentially means additionally filtering the basic concordance which is specified by the query in the main part of the form. The user can set the span of the context to which the additional filter condition will be applied, the query type, or word class.

In general it can be said that any given search in context can be rewritten as an ordinary query which is then filtered (with the aid of positive and negative filters). Any filtering can also be carried out with the help of query language , performing the identical operation in a single step. The reality is that there are always more ways to achieve the same result, and it is entirely up to the user to decide which option he is most comfortable with.

## Specify query according to the meta-information

If we need to search only in a narrowly defined group of texts in the entire corpus, we have two options. Either we create our own virtual subcorpus, which we will then be able to select within the offered corpora, or we can restrict the query with a number of conditions (typically with the command within). As a rule, we choose the first option in situations where we know that we will be needing the subcorpus for a longer duration of time, or when its specification is complex. We use the second option when conducting ad hoc searches within some clearly defined text categories, which are specified with the help of the basic structural attributes.

The New Query form allows for simplification by way of an additional form Specify query according to the meta-information , which is located underneath the context search and is activated with a click, similar to the (above mentioned) context specification.

Form for searching in a subcorpus created ad hoc

Within this form it is possible to mark off the values of selected structural attributes that interest us. The form does not contain all structural attributes, but only those most often used in the given corpus (e.g. when searching in the SYN2010 it is txtype_group, txtype, genre, med, srclang). The abbreviations used can be found in the lists section.

In the final column we can find a list of the specific opuses or documents (based on the selected corpus), which correspond to the specified condition. If such a list would be too long, the given column contains only the number of items. If we select some categories from the menu, we can view an inventory of texts which meet the given conditions with the help of the button refine selection (bottom left). The column containing the list of texts is recalculated according to the currently marked criteria. We can continue in this way until we are satisfied with the demarcation of the data that we want to use for our search.

For a more detailed specification it is necessary to either use the condition within inside a CQL query, or to create a new virtual subcorpus.

# Recent queries

The item displays an overview of the most recent queries used (a simplified list of previous queries is also accessible directly from the query form, via a link above the input line). These queries can be filtered according to the query type or the currently used corpus, and only archived queries can be viewed as well. By clicking on the link Edit and search, we paste a previously specified constraints into the query form and we may either use it without any changes, or we may modify it further (e.g. change the corpus in which the query will be used, the query type, or we may specify the context). By clicking on the Archive option, we can name the query and permanently save it to the query history archive for later reuse.

# Word list

The basic output of any query is a concordance, i.e. a list of all the occurrences (tokens) matching the query, along with their text surroundings. The Word list function evaluates the query in such a way that the result is a list of various words (types), matching the query, together with their absolute frequency, ARF or number of documents in which the wanted phenomenon occurs. In this respect, the Word list function is analogous to frequency distribution, however its advantage is its speed and low computational complexity, because the extra step involving the concordance is not needed with the Word list.

Form for creating word lists

Various search parameters can be set in the form:

• corpus (or its subcorpus), in which the word list will be created
• attribute (positional or structural), which is to be included in the list
• RE pattern (regular expression), to which the resulting words must correspond (if it is not submitted, the list will contain all items in the corpus if they fulfill the other specifications in the form)
• minimum frequency
• whitelist – a list of pre-selected words (in a separate file) which we want to see in the resulting list
• blacklist – a list of pre-selected words (in a separate file) which we want to exclude from the resulting list
• option “Include non-words”, which widens the search to words which are not composed only of alphabetic characters

Among the output option settings we can find a selection of either the absolute frequency, ARF or a document count. Furthermore there is also the possibility to choose a specific output attribute (or attributes). These attributes need not be identical to the positional attribute selected in the top section of the form, on which all the above mentioned filters are applied. This enables us to create e.g. a frequency list of all verbs by selecting the attribute tag in the top section, applying the condition for a verb as in V.* and finally by “switching” the output type to lemma – an example of such a query is shown in the picture.