This is an old revision of the document!

InterCorp Release 13ud – Universal Dependencies

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	141,032,521	116,673,043	394,042,551	1,550,071,364
Positions	Number of word forms	113,838,505	89,819,773	327,968,369	1,223,270,610
Structural attributes	Number of documents	1,657	30	3,994	282
	Number of texts	1,657	111,951	3,994	1,843,528
	Number of sentences	9,782,002	13,606,198	24,318,736	143,196,252
Further information	reference	YES
	representative	NO
	publication date	2021
	foreign languages	40
	tagged languages	35
	lemmatized languages	35
	syntactically annotated languages	35

InterCorp release 13ud contains the same texts as InterCorp release 13, both versions differ only in their linguistic annotation. However, due to a different way of tokenization, token counts in release 13ud can slightly differ.

Main differences between releases 13 and 13ud

In release 13ud, out of the total number of 41 languages (including Czech), 36 are linguistically annotated; in addition, all such languages are syntactically annotated.
Texts are annotated in the same way in all languages, according to the UD standard ( Universal Dependencies).
General guidelines for annotation are provided on the UD project website (UD Guidelines), including a detailed description of:
- word types (Universal POS tags)
- morphological categories (Universal features)
- syntactic functions (Universal Dependency Relations)
Annotation was performed for all languages by UDPipe, based on the data created in the UD project.¹⁾
For use in KonText, fused forms or aggregates, ie word forms composed of two or even three syntactic words, were modified as divided tokens. In Czech it concerns, for example, the forms ses (se+jsi) or oč (o+co), in English isn't or cannot, in German zur (zu+der) or am (an+dem), in Polish miałam (miała+m), žebyś (że+by+ś) or chciałbym (chciał+by+m), in French des (de+les), aux (à+les) or auquel (à+lequel).²⁾
Some attributes were added to facilitate orientation in the syntactic structure. These data include references to important properties of the syntactic governor (lemma, part of speech and morphological categories). If a content word occurs with a function word (eg preposition, auxiliary verb, subordinate conjunction, determiner), the content word includes some properties of the function word.
Frequently used morphological categories from the features list (feats) have been promoted to the status of regular attributes. This applies, for example, to morphological case, number and gender (case, number, gender), or person.
Annotations between languages differ only in the number of attributes, see List of attributes by language, described below in Description of the list of Attributes.
KonText makes supports queries by word class and other morphological categories using the Insert tag function, which inserts a UD POS (upos) and any category from the feats list into the query. The Insert tag feature is available for all linguistically annotated languages.

UD and KonText

Corpus Search

Basic query

A basic query for a word form or phrase is entered in the same way as in previous releases of InterCorp.³⁾

Query for a lemma and a morphological tag

As in previous releases of InterCorp, a lemma and a morphological tag can be entered in an advanced query. For most linguistically annotated languages (except be, da, en, fr, hu, no and ru) it is possible to enter a tag from a language-specific set (national tagset), usually identical to the set used in the previous releases of InterCorp for that language. Just use the xpos attribute instead of the tag attribute. E.g. the query on feminine nouns in the vocative singular in Czech can be entered as follows: [xpos = "NNFS5.*"].
According to UD, part of speech and morphological categories are listed separately as values of the attributes upos and feats, respectively. Their values can be entered using the Insert tag function. A query for proper names without using the Insert tag function can be specified as follows: [upos = "PROPN"].

upos	gloss
ADJ	adjective
ADP	adposition (incl. preposition)
ADV	adverb
AUX	auxiliary verb
CCONJ	coordinating conjuction
DET	determiner
INTJ	interjection
NOUN	noun
NUM	numeral
PART	particle
PRON	pronoun
PROPN	proper noun
PUNCT	punctuation
SCONJ	subordinating conjunction
SYM	symbol
VERB	verb
X	other

Morphological categories are given as a pair <category name> = <category value>. These pairs are listed as list items under the feats attribute and are separated by the | character. E.g. for the Czech noun moře 'sea' in the nominative, the morphological categories as the value of the feats attribute are listed as follows: Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos. The form moře annotated in this way can be found, for example, by the query [upos="NOUN" & feats="Number=Sing"] .
Some categories are also available outside the feats list, so the same query can be entered more easily: [upos="NOUN" & number="Sing"] . For technical reasons, category names outside the feats list are given in lowercase, including, for example, verb_form instead of VerbForm.
The same or comparable morphological categories and their values have the same name in all languages:

category	gloss	example values
Abbr	abbreviation	Yes
Animacy	animacy	Anim, Inan, Hum, Nhum
Aspect	aspect	Imp, Perf, Hab, Iter, Prog, Prosp
Case	case	Nom, Gen, Dat, Acc, Voc, Loc, Ins, …
Definite	definiteness	Ind, Def, …
Degree	degree	Pos, Cmp, Sup, Equ, Abs
Foreign	foreign word	Yes
Gender	gender	Fem, Masc, Neut, Com
Mood	mood	Ind, Imp, Cnd, …
NumType	numeral type	Card, Ord, Mult, Frac, Sets, …
Number	number	Sing, Plur, Dual, Ptan, Coll, …
Person	person	1, 2, 3, …
Polarity	polarity	Neg, Pos
Polite	politeness	Infm, Form, Elev, Humb
Poss	possessiveness	Yes
PronType	type of pronoun etc.	Prs, Rcp, Art, Int, Rel, Exc, Dem, Emp, Tot, Ind
Reflex	reflexiveness	Yes
Tense	tense	Pres, Past, Fut, Pqp, Imp
Typo	typo	Yes
VerbForm	verb form	Fin, Inf, Part, Conv, Ger, Vnoun, Sup
Voice	voice	Act, Pass, Mid, Cau, …

Query for a part of speech and morphological categories using the menu

When entering an advanced query, you can use the Insert tag function, which lets you select the POS and/or the values of the relevant categories (properties) from the feats list in all linguistically annotated languages. The offer of properties for a given POS is determined by their actual occurrence in the corpus, so the list may reflect incorrect combinations.

Query for a syntactic function

Syntactic function is specified for each token as the value of the deprel attribute.
E.g. a query to show the occurrences of the verb běhat 'run' in the function of the governor of an adnominal clause, is entered as [lemma="run" & deprel="acl"].
The table below distinguishes four types of syntactic functions by different typeface:
- Common deprels are listed in bold.
- Deprels of function words are listed in bold italics.
- Deprels for representing coordination and similar phenomena in the dependency structure or for a technical purpose are set in italics.
- Deprels not used in Czech are listed in in gray.

deprel	gloss	example⁴⁾
acl	adnominal clause	muž, o kterém jsme mluvil
advcl	adverbial clause	Spěchal, aby přišel včas.
advmod	adverbial modifier	geneticky upravené potraviny
amod	adjectival modifier	Václav si vzal třímilionovou* půjčku.*
appos	apposition	Přijel Michal, můj bratr a Davidův bratranec.
aux	auxiliary verb	Mohli byste* přijet už příští týden?*
case	case marking (incl. preposition)	Bydlím na samotě.
cc	coordinating conjunction	Je to mladý a nadějný chlapík.
ccomp	clausal complement	Ještě včera hlásili, že pršet nebude.
clf	classifier	三个学生 sān gè xuéshēng
compound	compound	Bude to stát padesát pět* tisíc korun.*
conj	non-initial conjunct	Teta včera večer přijela, přespala* a ráno zase odjela.*
cop	copula	Lenka je v kondici.
csubj	clausal subject	Obžalovanému přitížilo, že neměl alibi.
dep	unspecified dependency	My dad doesn't really not that good.
det	determiner	Která kniha se vám líbí nejvíc?
discourse	discourse element	čemu že se to zpronevěřily
dislocated	dislocated elements	Dumplings I like.
expl	expletive	There is a ghost in the room.
fixed	non-initial parts of fixed multiword unit	ve srovnání* například s úvěry*
flat	non-initial parts of flat multiword unit	Nejlépe to vyjádřil papež Jan* Pavel II.*
goeswith	non-initial parts of incorrectly split form	Zastavil se a z těžka* oddychoval.*
iobj	indirect object	Vysvětlila studentům* svůj plán.*
list	non-initial parts of list	Steve Jones tel.: 555-9814 e-mail: jones@abc.edf
mark	marker (subordinating conjunction)	Nevěděli jsme, že babička není doma.
nmod	nominal modifier	kancelář ředitele
nsubj	nominal subject	Auto je červené.
nsubj:pass		Vypnutí vysílačky se trestá.
nummod	numeric modifier	Jedno kotě spalo.
nummod:gov		Pět mužů hrálo karty.
obj	object	Cením si vaší pomoci.
obl	oblique nominal	Potkal jsem ho minulý čtvrtek.
orphan	orphan after elided head	Pavel si objednal špenát a Markéta brokolici.
parataxis	parataxis (incl. parentheticals)	„Ten člověk,“ řekl Honza,* „odjel brzy ráno.“*
punct	punctuation	Máte všecko?
reparandum	overridden disfluency	Jděte dopra- doleva.
root	root	Miluju anglickou kuchyni.
vocative	vocative	Honzo, pojď mi pomoct!
xcomp	open clausal complement	Doktorka mi doporučila denně cvičit.

Query results

Formatted text

After clicking on the keyword and Formatted text in the context box header, a concordance will appear along with the nearest context in a form that is close to the typography of the original text. For example, there are no spaces between the end of a word and punctuation, and paragraphs are separated by a blank line.

Syntactic structure display

After clicking on the syntax tree icon at the beginning of each concordance line, the syntactic structure of the sentence is displayed. For each node, the word form, POS and syntactic function of the word relative to the given token are given. After clicking on the node, other annotation will appear, especially the lemma of the form.
Multi-part tokens (aggregates) are divided into multiple nodes and the word form then corresponds to the relevant part of the token (the iword attribute). After clicking on such a node, in addition to the lemma of the given part of the multi-word token, its full form (as a separate word, the sword attribute) and the word form of the entire token (word) also appear.
In the text line above the structure and in the structure, under the cursor the relevant strings and nodes are highlighted in parallel.

Examples of queries

The queries assume the Czech subcorpus, except when stated otherwise.

[case_lemma="o" & case="Acc"]

- Finds accusative nominals in with the preposition o. The governing verbs can be listed using frequency distribution according to the attribute p_lemma.

[deprel="obj" & case="Dat" | deprel="conj" & p_deprel="obj" & case="Dat"]

- Finds dative objects, even non-initial conjuncts.

[deprel="nsubj" & upos="PROPN" | deprel="conj" & p_deprel="nsubj" & upos="PROPN"]

- Finds proper nouns as subjects, even non-initial conjuncts.

[upos="NOUN" & case="Ins" & deprel="obj" & p_feats="VerbForm=Inf"]

- Finds nouns in the instrumental case as objects of an infinitive. The infinitives can be listed using frequency distribution according to the attribute p_lemma.

[feats="Gender=Neut" & feats="Number=Sing" & feats="Tense=Past" & feats="VerbForm=Part" & upos="VERB" & aux_feats="Person=1"]

- Finds l-participles in neuter singular used with an auxiliary verb in the first person. The query for the participle was entered using the function Insert tag. The same result is obtained by the following query, which uses categorial attributes outside the feats list:

[gender="Neut" & number="Sing" & tense="Past" & verb_form="Part" & upos="VERB" & aux_feats="Person=1"]

1:[lemma="vidět|slyšet"] []* 2:[case="Acc" & deprel="obj"] []* 3:[verb_form="Inf" & deprel="xcomp"] & 2.head=1.id & 3.head=1.id within <s/>

- Finds sentences with verbs vidět 'see' or slyšet 'hear' governing an accusative object and an infinitive xcomp. There can be any number of other words between these tokens, but only within the sentence.

[voice="Act" & aux_feats="Mood=Cnd" & aux_feats="Tense=Past"]

– Finds sentences including a verb in the active voice and past conditional mood, e.g. Kdybych si nebyl oholil knír … 'If I hadn't shaved my moustache…'

[voice="Pass" & aux_feats="Mood=Cnd" & aux_feats=".*Tense=Past.*Tense=Past.*"]

– Finds sentences including a verb in the passive voice and past conditional mood, e.g. … aféra by byla bývala ututlána. '… the scandal would have been hushed up.'⁵⁾ ⁶⁾

[feats="VerbForm=Ger" & aux_feats="VerbForm=Fin" & aux_feats="VerbForm=Part"]

– In English: finds sentences including continuous perfect forms (both present and past), e.g. … has been constantly increasing in velocity.

Morphological annotation

Parts of speech

In UD, part of speech is listed separately from other categories.
Parts of speech are the same for all languages.
Part of speech is given as the value of the attribute upos.
For most languages, the xpos attribute includes a language-specific morphological tag.

Other categories

Other categories are determined by part of speech and language.
Each category is listed as a “<category name>=<category value>” pair, e.g. Number=Sg.
A list of such pairs is the value of the feats attribute.
Categories in the feats attribute are separated by “|”, e.g. the form школы in genitive singular is marked as Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing.
A CQL query can specify each part of the tag separately, e.g. [upos="NOUN" & feats="Gender=Fem" & feats="Case=Gen"] (the order of categories is irrelevant).
The query can also be formulated on a string of characters, e.g. [upos="NOUN" & feats=".*Case=Gen.*Gender=Fem.*"]. The result is the same in both cases.
Some of the categories in feats are also listed as additional attributes. These attributes can be used in searches or to generate frequency lists.

Multi-part tokens

Some tokens, called aggregates, consist of multiple parts.
These parts correspond to different nodes in the syntactic structure.
Such tokens in Czech include forms such as abychom, ses, bylas or oč).
Parts of such tokens are separated by the “|” character.
The parts are given both in the form corresponding to the original form (e.g. se|s) and in the form that would correspond to its unabbreviated version (e.g. se|jsi).
Multi-part tokens are searched for as full forms in all languages. This is the case, for example, with English contractions (can't) or Polish agglutinated forms (byłbym).

Syntactic annotation

Each token specifies its syntactic function, i.e. dependency relation (deprel) and a reference to its syntactic governor (head).

Representation of syntactic structure by references

In addition to the head reference, some other attributes of the head are listed for each token: lemma, POS, morphological category, syntactic function.
A token may also have attributes that specify the properties of a fuction word that depends on the token.
Similar means of representing syntactic structure are used by other syntactically annotated corpora available in the KonText browser (e.g. syn2020).

Function words

According to UD, function words include auxiliary verbs, adpositions, subordinating conjunctions, conjunctions, determiners, and quantifiers.
Function words depend on the corresponding content words.
Types of function words are specified by their syntactic function.
For each function word the content word governor may include the function word's lemma, upos, feats and a more detailed specification of type.
The names of the corresponding content word attributes consist of the function word's deprel and attribute.
For example, case_lemma specifies the lemma of the noun or pronoun's preposition.

Coordination

The first conjunct depends on the governor of the entire coordination. Its syntactic function determines the syntactic function of the whole coordination.
The second and subsequent conjuncts always depend on the first conjunct. Their syntactic function is specified as conj.
Conjunctions depend on the following conjunct. Their syntactic function is cc.
A reference to the so-called effective head is used to identify the head regardless of whether the token is a conjunct or not, or whether it is in the initial or non-initial conjunct.

Description of the list of attributes

In Attribute list by language, all attributes used in the corpus are listed.
Columns indicate whether the attribute is used for the language specified by the abbreviation in the header.
Attributes are divided into four categories, distinguished by background color.

Basic attributes

These 12 attributes are on the light purple background.
They consist of the following items: word form, lemma, part of speech, morphological categories, token order in a sentence, head reference and syntactic function.
They are usually taken directly from the output of the tool UDPipe. The format of the output is CoNLL-U.
There are two added attributes: lc and lc_lemma, which repeat word form and lemma without any capital letters.
For languages with multipart tokens (aggregates), there are also two additional sword and iword attributes.
The sword attribute includes the word form of the aggregate split by the “|” character into parts corresponding to syntactic words as they occur outside an aggregate, e.g. for nač and abychom the values of sword equal na|co and aby|bychom.
The iword attribute splits the aggregate into parts without any modification, for the tokens nač and abychom the values of iword egual na|č and a|bychom.

Structural attributes

These 7 attributes are on the light blue background.
They extend the reference to the token's syntactic governor (head) by additional attributes, making it easier to identify the head and its properties.
All attributes of this type are avaliable for all languages.

Function word attributes

These attributes are on the light green background.
They are given within the content word in order to specify the essential properties of the dependent function word.
The total number of function word attributes is 20, but no language uses them all.
Attributes refer to 6 types of auxiliary words, determined by their syntactic function in relation to the semantic word.
For each function word, the lemma, part of speech, morphological categories and subtype of the function word can be specified.
An attribute name consists of the name of the function word's syntactic function and the name of its property (attribute).
Unused or uninformative attributes are absent for the given language. There are four possible combinations which do not occur in any language.
Most languages (35) use the attribute case_lemma (lemma of apposition, most often prepositions), followed by mark_lemma (lemma of subordinate conjunctions, in 33 languages).
The clf_lemma (lemma of classifier) attribute only appears in Chinese.
If there are several auxiliaries of the same type for a content word, their values are separated by the “|” character.

Attributes representing selected categories

On the light brown background, there is a selection of 18 attributes from the feats list.
Only Latvian uses them all, while Maltese uses none. In addition to the language type, their presence or absence also depends on the availability of the category in the UD data.

Errors and shortcomings of linguistic annotation according to UD

POS and morphological categories do not match
Inconsistencies in the application of the principles of uniform classification of phenomena in all languages
Errors and inconsistencies in the given language (e.g. udělals as a unitary token)

The quality of annotations in different languages differs mainly in the volume and quality of training data. It is also affected by the method and tool used for annotation.

We will be grateful for every reported error, discrepancy, deficiency, comment and suggestion at the address CNC user support. Please include the abbreviation “UD” at the beginning of the message subject.

References

Selection of literature about UD

Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel Zeman (2021): Universal Dependencies. In: Computational Linguistics, ISSN 1530-9312, vol. 47, no. 2, pp. 255-308.

Daniel Zeman (2018): The World of Tokens, Tags and Trees. ISBN 978-80-88132-09-7.

For a complete list, see here.

Tutorials and lectures about UD

Daniel Zeman: Universal Dependencies and the Slavic Languages. Warsaw, 19.11.2018.

Joakim Nivre, Daniel Zeman, Filip Ginter, Francis M. Tyers: Tutorial on Universal Dependencies: Adding a new language to UD

Anna Nedoluzhko, Michal Novak, Martin Popel, Zdenek Zabokrtsky and Daniel Zeman: Coreference meets Universal Dependencies. Prague, 19/04/2021.

Daniel Zeman: Reflexives in Universal Dependencies. Prague, 04/03/2019.

¹⁾

The tool uses all data for the given language, ie all treebanks listed on https://lindat.mff.cuni.cz/services/udpipe/IUDPipe. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.

²⁾

Aggregates are present in the following languages: ar, ca, cs, de, el, en, es, fi, fr, he, it, pl, pt, tr and uk. A list of all aggregates for a given language is displayed as the frequency distribution of word forms following the query [sword = ".|.+"].

³⁾

In a basic query, it is no longer necessary in some languages to separate parts of the aggregate with a space, eg był, by, and m of the Polish agglutinated form byłbym or is and n't of the English contraction isn't, even in a longer expression (aren't I). However, a basic query for is or n't will not show concordances including the for isn't.

⁴⁾

The constituent performing the given function is highlighted. If the constituent consists of more than one word, the constituent's governor (head word) is underlined. It is this token which is annotated by the given function.

⁵⁾

The form of the content verb used in the periphrastic passive has an adjectival lemma, e.g. ututlaný 'hushed', the adjectival POS upos=ADJ and its morphological categories include the featuresfeats="...Variant=Short|VerbForm=Part|Voice=Pass". On the other hand, reflexive passive, e.g. oholil se '[he] shaved himself', is annotated as feats="...Voice=Act".

⁶⁾

According to the UD guidelines, function words are immediate dependents on the relevant content word. In InterCorp 13ud, values of the feats attribute specified in multiple function words dependent on a single content word governor are concatenated into a single value. If so, categories such as Tense can occur more than once in the value of such a feats attribute, because it originates in two or more auxiliaries, as in our example from byla '[she] was' and bývala '[she] used to be'. This double occurrence is what the query uses to target the presence of two auxiliaries. If a query looking for passive voice verbs would mention only [aux_feats="Tense=Past"], the result would include also present conditional forms, where the l-ové participle (the "Tense=Past" form) occurs just bonce as the passive auxiliary (… aféra by byla ututlána. 'the scandle would be hushed up.').

Trace: • verze13ud

Table of Contents