This is an old revision of the document!

InterCorp Release 13ud – Universal Dependencies

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	141,032,521	116,673,043	394,042,551	1,550,071,364
Positions	Number of word forms	113,838,505	89,819,773	327,968,369	1,223,270,610
Structural attributes	Number of documents	1,657	30	3,994	282
	Number of texts	1,657	111,951	3,994	1,843,528
	Number of sentences	9,782,002	13,606,198	24,318,736	143,196,252
Further information	reference	YES
	representative	NO
	publication date	2021
	foreign languages	40
	tagged languages	35
	lemmatized languages	35
	syntactically annotated languages	35

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Martin Vavřín if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the Universal Dependencies scheme.

Main differences between releases 13 and 13ud

In release 13ud, out of the total number of 41 languages (including Czech), 36 are linguistically annotated; in addition, all such languages are syntactically annotated.
Texts are annotated in the same way in all languages, according to the UD standard ( Universal Dependencies).
General guidelines for annotation are provided on the UD project website (UD Guidelines), including a detailed description of:
- word types (Universal POS tags)
- morphological categories (Universal features)
- syntactic functions (Universal Dependency Relations)
Annotation was performed for all languages by UDPipe, based on the data created in the UD project.¹⁾
In other releases of InterCorp, word class and morphological categories of a word are specified as the value of the tag attribute. For most languages, InterCorp release 13ud retains these language-specific tags in the xpos attribute. However, the UD word class and morphological categories, denoted uniformly for all languages, are listed separately as values of the upos and feats attributes (see below Parts of speech, and Other categories, respectively). Frequently used morphological categories from the feats list have been promoted to the status of regular attributes at the same level as upos. This applies, for example, to morphological case, number, gender or person (case, number, gender, person).
For use in KonText, fused forms or aggregates, ie word forms composed of two or even three syntactic words, were modified as divided tokens. In English it concerns, for example, the forms isn't or cannot. For more details see Multi-part tokens below.
Each word is assigned its syntactic function (deprel – see Syntactic functions) and its syntactic governor in the dependency tree (head). To facilitate orientation in the syntactic structure, each word is also annotated with references to important properties of its head (lemma, part of speech and morphological categories), see References to syntactic head. If a content word occurs with a function word (eg. preposition, auxiliary verb, subordinate conjunction, determiner), the content word includes some properties of the function word (see References to function words).
Annotations between languages differ in the number of categorial attributes and in links to function words, see List of attributes by language, described below in Description of the list of attributes.
KonText makes supports queries by word class and other morphological categories using the Insert tag function, which inserts a UD POS (upos) and any category from the feats list into the query. The Insert tag feature is available for all linguistically annotated languages.

Texts in the corpus

InterCorp release 13ud contains the same texts as InterCorp release 13. They differ only in linguistic annotation. However, the token and word count data in release 13ud may differ slightly due to a different tokenization method.

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

Political commentaries published by Project Syndicate and VoxEurop (formerly PressEurop)
A package of legal texts of the European Union form the Acquis Communautaire corpus
Proceedings of the European Parliament dated 2007–2011 from the Europarl corpus
Film subtitles from the Open Subtitles database
Translations of the Bible

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in thousands of words

Language		Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Bible	Total
ar	Arabic	34	0	0	0	0	0	0	34
be	Belarusian	5,718	0	0	0	0	0	0	5,718
bg	Bulgarian	7,068	0	0	13,577	9,083	0	0	29,728
ca	Catalan	7,938	0	0	0	0	0	736	8,674
da	Danish	7,136	0	0	20,313	13,916	14,429	657	56,451
de	German	37,633	4,704	2,483	20,610	13,088	8,392	724	87,634
el	Greek	0	0	0	23,853	15,404	23,709	0	62,966
en	English	33,569	4,856	2,670	22,902	15,576	52,106	730	132,409
es	Spanish	26,554	5,614	2,859	26,262	16,249	36,650	0	114,187
et	Estonian	0	0	0	14,896	10,899	10,298	0	36,093
fi	Finnish	5,656	0	0	15,269	10,108	15,047	543	46,622
fr	French	19,773	5,600	3,046	26,200	17,179	25,986	764	98,547
he	Hebrew	0	0	0	0	0	16,221	0	16,221
hi	Hindi	409	0	0	0	0	0	0	409
hr	Croatian	21,923	0	0	0	0	19,048	571	41,543
hu	Hungarian	6,444	0	0	17,852	12,198	21,115	0	57,609
is	Icelandic	0	0	0	0	0	1,581	0	1,581
it	Italian	14,525	1,252	2,747	23,771	15,494	14,700	684	73,174
ja	Japanese	2,189	0	0	0	0	477	0	2,666
lt	Lithuanian	421	0	0	17,316	11,213	558	471	29,979
lv	Latvian	2,646	0	0	17,522	11,682	280	537	32,667
mk	Macedonian	8,881	0	0	0	0	1,877	0	10,758
ms	Malay	0	0	0	0	0	3,521	0	3,521
mt	Maltese	0	0	0	13,935	0	0	0	13,935
nl	Dutch	16,216	813	2,953	23,416	15,558	29,373	717	89,045
no	Norwegian	7,727	0	0	0	0	0	722	8,449
pl	Polish	26,200	0	2,380	19,604	12,817	26,576	583	88,161
pt	Portuguese	4,981	554	2,782	24,598	15,193	41,468	706	90,282
rn	Romani	14	0	0	0	0	0	0	14
ro	Romanian	4,219	0	2,738	8,092	9,446	34,128	0	58,622
ru	Russian	8,642	3,984	0	0	0	6,887	565	20,078
sk	Slovak	8,543	0	0	18,399	12,727	5,133	561	45,363
sl	Slovene	3,871	0	0	18,528	12,251	17,061	0	51,711
sq	Albanian	0	0	0	0	0	2,003	0	2,003
sr	Serbian	11,582	0	0	0	0	20,727	0	32,308
sv	Swedish	15,790	0	0	19,542	13,784	14,666	638	64,419
tr	Turkish	0	0	0	0	0	21,190	0	21,190
uk	Ukrainian	11,459	0	0	0	0	244	596	12,299
vi	Vietnamese	0	0	0	0	0	1,474	0	1,474
zh	Chinese	127	240	0	0	0	2,247	0	2,614
Subtotal		327,887	27,616	24,658	406,459	263,864	489,169	11,504	1,551,157
cs	Czech	113,839	4,351	2,310	19,085	12,908	50,604	562	203,658
TOTAL		441,725	31,967	26,968	425,543	276,772	539,774	12,066	1,754,815

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphological annotation

Parts of speech

In UD, part of speech is listed separately from other categories as the value of the upos attribute.
Parts of speech given in upos are the same for all languages.
In addition to upos, most languages provide a language-specific morphological tag, as the value of the xpos attribute. The xpos value is usually identical to a corresponding tag from the other, non-UD-based versions of InterCorp.

upos	gloss
ADJ	adjective
ADP	adposition (incl. preposition)
ADV	adverb
AUX	auxiliary verb
CCONJ	coordinating conjuction
DET	determiner
INTJ	interjection
NOUN	noun
NUM	numeral
PART	particle
PRON	pronoun
PROPN	proper noun
PUNCT	punctuation
SCONJ	subordinating conjunction
SYM	symbol
VERB	verb
X	other

Other categories

Other categories are embedded under the feats attribute. Their choice and values are determined by part of speech and language.
Each category is listed as a “<category name>=<category value>” pair, e.g. Number=Sg.
Identical or comparable morphological categories and their values are called the same in all languages.
A list of such pairs is the value of the feats attribute.
Categories in the feats attribute are separated by “|”, e.g. the Russian form школы /'ʂkolɨ/ 'school' in genitive singular is marked as feats="Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing".
In an advanced query using the CQL query language each category can be specified separately: the Czech form moře 'sea' is one of the answers to the query [upos="NOUN" & feats="Number=Sing"]. The Russian form is found follwoing the query [upos="NOUN" & feats="Gender=Fem" & feats="Case=Gen"]. The order of categories in the query is irrelevant.
The value of feats can also be treated as a string of characters using regular expressions, e.g. [upos="NOUN" & feats=".*Case=Gen.*Gender=Fem.*"]. Here the order of categories in the query should correspond to their order in the corpus. The result is the same in both cases.
Some of the categories in feats are listed also outside the list as categorial attributes at the same level as upos. As a result, a query for a singular noun can be simply as follows: [upos="NOUN" & number="Sing"]. Similarly, the query for the Russian form [upos="NOUN" & gender="Fem" & case="Gen"] gives the same result as the two queries above. Categorial attributes can be also used to generate frequency lists.²⁾ Such attributes appear on the light brown background in Attribute list by language or in KonText in the lower part of the table shown in View / Corpus-specific settings… .

category	gloss	example values
Abbr	abbreviation	Yes
Animacy	animacy	Anim, Inan, Hum, Nhum
Aspect	aspect	Imp, Perf, Hab, Iter, Prog, Prosp
Case	case	Nom, Gen, Dat, Acc, Voc, Loc, Ins, …
Definite	definiteness	Ind, Def, …
Degree	degree	Pos, Cmp, Sup, Equ, Abs
Foreign	foreign word	Yes
Gender	gender	Fem, Masc, Neut, Com
Mood	mood	Ind, Imp, Cnd, …
NumType	numeral type	Card, Ord, Mult, Frac, Sets, …
Number	number	Sing, Plur, Dual, Ptan, Coll, …
Person	person	1, 2, 3, …
Polarity	polarity	Neg, Pos
Polite	politeness	Infm, Form, Elev, Humb
Poss	possessiveness	Yes
PronType	type of pronoun etc.	Prs, Rcp, Art, Int, Rel, Exc, Dem, Emp, Tot, Ind
Reflex	reflexiveness	Yes
Tense	tense	Pres, Past, Fut, Pqp, Imp
Typo	typo	Yes
VerbForm	verb form	Fin, Inf, Part, Conv, Ger, Vnoun, Sup
Voice	voice	Act, Pass, Mid, Cau, …

Multi-part tokens

Some tokens, in the UD parlance called fused words, or aggregates in some Czech corpus-related literature, consist of multiple parts. These parts correspond to different nodes in the syntactic structure. In English, such tokens represent contractions, consisting of a verb and the negative particle such as isn't or cannot.
The orthographic form of these words is preserved in the corpus, the individual parts are separated only in the annotation - e.g. in the value of the lemma attribute, with the “|” sign as the separator. It is therefore possible to search for them like other words, by typing the full form into the search box in a simple query (e.g. ses in Czech, can't in English or byłbym in Polish), or in the advanced query using the CQL search language give the same strings as the value of the word attribute .
In some languages, including English and Czech, a part of the fused token has a different form when occuring in a different context as an orthographically separate word. E.g. n't, a part of isn't, corresponds to not, the Czech auxiliary clitic s, a part of ses, corresponds to jsi. Both variants are represented in the annotation: the iword attribute shows the original form is|n't or se|s, while the sword attribute shows the unabreviated, “reconstructed” version: is|not or se|jsi.³⁾
In addition to the English tokens isn't (is|n't – is|not) or cannot (can|not),⁴⁾ in Czech there are tokens such as abychom (a|bychom – aby|bychom), bylas (byla|s – byla|jsi) or oč (o|č – o|co); in German zur (zu|r – zu|der) or am (a|m – an|dem); in Polish miałam (miała|m), żebyś (że|by|ś) or chciałbym (chciał|by|m); in French des (de|s – de|les), aux (au|x – à|les) or auquel (au|quel – à|lequel).

Syntactic annotation

Syntactic functions

Each token specifies its syntactic function, i.e. dependency relation (deprel) and a reference to its syntactic governor (head).
The table below distinguishes four types of syntactic functions by different typeface:
- Common deprels are listed in bold.
- Deprels of function words are listed in bold italics.
- Deprels for representing coordination and similar phenomena in the dependency structure or for a technical purpose are set in italics.
- Deprels not used in English are listed in in gray.
In some languages, some deprels may have subtypes. The subtype name follows the colon after the deprel name, e.g. acl:relcl indicates an attribute expressed by a relative clause. The list below contains only subtypes relevant to English and represented in the corpus. Functions with subtypes for all languages are listed at Universal Dependency Relations.
When querying a deprel that may have a subtype, a possible subtype should be taken into account. For example, to find all words with the deprel acl, whether or not the deprel has a subtype, use the expression deprel="acl.*" instead of deprel="acl". To find all auxiliary verbs, use the expression deprel="aux.*" instead of deprel="aux". To find all subjects, use the expression deprel="nsubj.*".
When a queried deprel targets a coordinated structure, only the first conjunct is found. The second and subsequent conjuncts are marked as deprel="conj". The syntactic function of the entire coordination is thus specified by the deprel attribute of the first cunjunct, the head of all other conjuncts. To query the “true” deprel of a non-initial conjunct (deprel="conj"), use the p_deprel attribute. See Coordination below for details.

deprel	gloss	example⁵⁾
acl	adnominal clause, finite or non-finite	The convent of the Poor Clares, known* as the Minories, was destroyed to make way for storehouses.*
acl:relcl	relative adnominal clause	London has always been a vast ocean in which survival is not certain.
advcl	adverbial clause	The country will pay a heavy price if the president’s obsessions prevail* for long*.
advmod	adverbial modifier	They were all* corrupt opportunists. Gorshkov knew where that idea came from .*
amod	adjectival modifier	The sustainable* future of humanity is at stake.*
appos	apposition	They were going to a new home, a house* of her choosing*.
aux	auxiliary verb	We have* made our voice heard by the world. It's going to work. You can't start improvising now.*
aux:pass	passive auxiliary	Men like that are* born only once. Who else should I get dressed up for if not her?*
case	case marking (incl. preposition)	Karpov's own career might hang in the balance
cc	coordinating conjunction	I now invite you all to eat, drink, and* make yourselves at home!*
cc:preconj	preconjunct	They are poisoning both* the water and the soil.*
ccomp	clausal complement	I doubt whether the new model is an improvement.
clf	classifier	三个学生 sān gè xuéshēng
compound	compound	In Gondor ten* thousand years would not suffice.*
compound:prt	phrasal verb particle	He laid out* the city’s streets and rebuilt its walls.*
conj	non-initial conjunct	You have two parents and you always will have.
cop	copula	Where's the rest of your luggage?
csubj	clausal subject, finite or nonfinite	It's quite easy to clear* up these contradictions. But the most important thing is you shouldn't lose too much time*.
csubj:pass	clausal subject of passive clause	Taking notes has been banned.
dep	unspecified dependency	By the 1860s, the South was utterly flush with cash. My dad doesn't really not that good.
det	determiner	What way they went I don’t know and no rabbit knows .
det:predet	predeterminer	People get sick all* the time.*
discourse	discourse element	‘Yes, please,’ said Ron. Oh dear, what a bore!
dislocated	dislocated elements	Dumplings I like.
expl	expletive	There is a ghost in the room.
fixed	non-initial parts of fixed multiword unit	At least* there's one of you brave enough! Of course there may be exceptions.*
flat	non-initial parts of flat multiword unit	Let's go to San Francisco. What was Miss O'Hara* up to?*
flat:foreign	non-initial parts of flat multiword unit	During the colonial period it was called the Portal de los Mercaderes* .*
goeswith	non-initial parts of incorrectly split form	They come here with out* legal permission.*
iobj	indirect object	He brought us eggs. Can I buy you* a drink?*
list	non-initial parts of list	Steve Jones tel.: 555-9814 e-mail: jones@abc.edf
mark	marker	I spent the night telling jokes to keep Petrik from* falling asleep at the wheel. I just want to know what you are thinking about when you wake up.*
nmod	nominal modifier	Did they put some fish near the infant's* grave for his journey into the afterlife ?*
nmod:npmod	noun phrase as adverbial modifier	He was younger then and a lot* more agile. It seemed that everyone had trembling hands and tear-filled eyes.*
nmod:poss	possessive nominal modifier	Many saw it as a good thing that her* show was taken off the air.*
nmod:tmod	temporal modifier	In Plenary today* I supported the amendment.*
nsubj	nominal subject	Those who venture upon its currents look for prosperity or fame, even if they often founder in its depths.
nsubj:pass	nominal subject of poassive clause	The horses* were adorned with just one red scarf.*
nummod	numeric modifier	Dissolution does but give birth to fresh modes of organization, and one* death is the parent of a thousand lives.*
obj	object	But who can stop the people? What* do you mean? I don't know what to do.*
obl	oblique nominal	We might bring an avalanche down on ourselves* for no good reason .*
obl:npmod	noun phrase as oblique nominal	I get fed up a little* sometimes.*
obl:tmod	temporal modifier	I leave tomorrow. Tell him everything, tonight.
orphan	orphan after elided head	Mary won gold and Peter bronze.
parataxis	parataxis (incl. parentheticals)	“Is that the only reason?” she asked, putting her eyes close to mine.
punct	punctuation	Máte všecko?
reparandum	overridden disfluency	Go to the right-* to the left.*
root	root	This was not a good moment* in the history of English cuisine.*
vocative	vocative	See you later, Sam.
xcomp	open clausal complement	Maria saw me standing* at the mirror.*

References to syntactic heads

In addition to the pointer to its head (head as the word ID of the head, i.e. its word order position within the sentence, or parent as its position relative to the given word), some other attributes of the head are listed for each token: lemma (p_lemma), POS (p_upos), morphological category (p_feats), and syntactic function (p_deprel).
A token may also have attributes that specify the properties of a fuction word that depends on the token. For example, the lemma of a preposition is shown by the attribute case_lemma, morphological categories of an auxiliary by aux_feats, morphological categories of a copula by cop_feats, part of speech of a determiner by det_upos, lemma of a marker by mark_lemma.
Similar means of representing syntactic structure are used by other syntactically annotated corpora available in the KonText browser (e.g. syn2020).

References to function words

According to UD, function words include auxiliary verbs, adpositions, subordinating conjunctions, conjunctions, determiners, and quantifiers.
Function words depend on the corresponding content words.
Types of function words are specified by their syntactic function, i.e. by the value of the deprel attribute: aux (auxiliaries), case (prepositions), mark (markers), cop (copula), det (determiners), and clf (classifiers).
For each function word the content word governor may include the function word's lemma, upos, feats and a more detailed specification of type, e.g. aux_type="pass" (see passive auxiliary), or det_type="numgov" (see pronominal quantifier governing the case of the noun).
The names of the corresponding content word attributes consist of the function word's deprel and attribute. For example, case_lemma specifies the lemma of the noun or pronoun's preposition, the aux_feats attribute of a content verb specifies morphological categories of its auxiliary.
A single content word can govern multiple function words, e.g. three for the passive present perfect conditional (she would have been pleased). The values of all the auxiliary words, separated by “|”, then appear in the appropriate attribute. The feats attribute values from multiple auxiliary verbs dependent on a single meaning are combined into a single value where some categories, such as verb form specifications, may be repeated because they come from more than one form. For example, in the sentence who would have guessed that, the aux_feats of the content verb guessed are composed of the feats of the auxiliary verbs would (Mood=Ind|Person=3|Tense=Past|VerbForm=Fin) and have (VerbForm=Inf).

Coordination

The first conjunct depends on the governor of the entire coordination. Its syntactic function determines the syntactic function of the whole coordination.
The second and subsequent conjuncts always depend on the first conjunct. Their syntactic function is specified as conj.
Conjunctions depend on the following conjunct. Their syntactic function is cc.
A reference to the so-called effective head is used to identify the head regardless of whether the token is a conjunct or not, or whether it is in the initial or non-initial conjunct: the e_id attribute refers to its identifier (the sequence number of the token representing the head within the sentence), the eparent attribute to its position relative to the token.
To find all words with a certain syntactic function, including those that are part of a coordination, use the p_deprel attribute. This attribute shows the syntactic function of the token's head. For example, a query for all indirect objects, including coordinated ones, can be formulated using the disjunction operator (|) as follows: [deprel="obj" | deprel="conj" & p_deprel="obj"].

UD and KonText

Corpus Search

Basic query

A basic query for a word form or phrase is entered in the same way as in previous releases of InterCorp.⁶⁾

Query for a lemma and a morphological tag

As in previous releases of InterCorp, a lemma and a morphological tag can be entered in an advanced query. For most linguistically annotated languages (except be, da, en, fr, hu, no and ru) it is possible to enter a tag from a language-specific set (national tagset), usually identical to the set used in the previous releases of InterCorp for that language. Just use the xpos attribute instead of the tag attribute. E.g. the query on feminine nouns in the vocative singular in Czech can be entered as follows: [xpos = "NNFS5.*"].
According to UD, part of speech and morphological categories are listed separately as values of the attributes upos and feats, respectively. Their values can be entered using the Insert tag function.
Parts of speech (upos) are the same for all languages. E.g. a query for proper names without using the Insert tag function can be specified as follows: [upos = "PROPN"].
Other morphological categories are listed under the feats attribute. Some of them are available separately under categorial attributes. For details see Other categories above.

Query for a part of speech and morphological categories using the menu

When entering an advanced query, you can use the Insert tag function, which lets you select the POS and/or the values of the relevant categories (properties) from the feats list in all linguistically annotated languages. The offer of properties for a given POS is determined by their actual occurrence in the corpus, so the list may reflect incorrect combinations.

Query for a syntactic function

Syntactic function is specified for each token as the value of the deprel attribute (see Syntactic functions above.
E.g. a query to show the occurrences of the verb run in the function of the governor of an adnominal clause, is entered as [lemma="run" & deprel="acl"]. Results include examples such as Everyone of the rabbits was seized by the instinct to run away, to go underground. Some people have the idea that rabbits spend a good deal of their time running away from foxes.

Query results

Formatted text

After clicking on the keyword and Formatted text in the context box header, a concordance will appear along with the nearest context in a form that is close to the typography of the original text. For example, there are no spaces between the end of a word and punctuation, and paragraphs are separated by a blank line.

Syntactic structure display

After clicking on the syntax tree icon at the beginning of each concordance line, the syntactic structure of the sentence is displayed. For each node, the word form, POS and syntactic function of the word relative to the given token are given. After clicking on the node, other annotation will appear, especially the lemma of the form.
Multi-part tokens (aggregates) are divided into multiple nodes and the word form then corresponds to the relevant part of the token (the iword attribute). After clicking on such a node, in addition to the lemma of the given part of the multi-word token, its full form (as a separate word, the sword attribute) and the word form of the entire token (word) also appear.
In the text line above the structure and in the structure, under the cursor the relevant strings and nodes are highlighted in parallel.

Examples of queries

The queries mainly show the possibilities of using syntactic functions in connection with parts of speech and morphological categories, but also include references to syntactis heads and dependent auxiliaries. Most of the queries concern English, but they are also applicable to other languages, although the specific language may require some modifications to the query. Queries can be entered in one language, or in two or more languages in parallel.

Who are the most likely singers?

[deprel="nsubj" & p_lemma="sing"]

This query finds subjects of the verb sing. One of the results is the sentence The birds sing sweetly in these trees.
The most frequent lexemes filling the subject slot of sing can be found from the list of keyword lemmas (in the KonText menu: Frequency / Lemmas).

What birds do most often

[deprel="nsubj" & lemma="bird"]

This query finds occurrences of bird(s) as the subject. The query finds e.g. the sentence A few birds flew off in disgust.
The verbs governing the subject can be listed using in the frequency distribution according to the p_lemma attribute (in the KonText menu: Frequency / Custom... / Attribute: p_lemma).

Prepositional cases

[case_lemma="about" & case="Acc"]

Finds accusative nominals, i.e. pronominal forms such as her or themselves, preceded by the preposition about. In English, only such forms are annotated as case="Acc". For nouns, the case attribute is not specified.⁷⁾
To extend the search to all nouns, drop case="Acc". The query [case_lemma="about"] finds all nominals governing the preposition about, i.e. all nominals in prepositional phrases beginning with this preposition, including sentences such as ‘May I ask what this is all about, sir?’ said Bigwig.
The governing verbs can be listed using frequency distribution according to the p_lemma attribute (in the KonText menu: Frequency / Custom... / Attribute: p_lemma).

[deprel="obj" & case="Dat" | deprel="conj" & p_deprel="obj" & case="Dat"]

- Finds dative objects, even non-initial conjuncts. Such a query finds e.g. the sentence In Trump, they have found a shameless frontman and TV personality who will do their bidding.

[deprel="nsubj" & upos="PROPN" | deprel="conj" & p_deprel="nsubj" & upos="PROPN"]

- Finds proper nouns as subjects, even non-initial conjuncts.

[upos="NOUN" & case="Ins" & deprel="obj" & p_feats="VerbForm=Inf"]

- Finds nouns in the instrumental case as objects of an infinitive. The infinitives can be listed using frequency distribution according to the attribute p_lemma.

[feats="Gender=Neut" & feats="Number=Sing" & feats="Tense=Past" & feats="VerbForm=Part" & upos="VERB" & aux_feats="Person=1"]

- Finds l-participles in neuter singular used with an auxiliary verb in the first person. The query for the participle was entered using the function Insert tag. The same result is obtained by the following query, which uses categorial attributes outside the feats list:

[gender="Neut" & number="Sing" & tense="Past" & verb_form="Part" & upos="VERB" & aux_feats="Person=1"]

1:[lemma="vidět|slyšet"] []* 2:[case="Acc" & deprel="obj"] []* 3:[verb_form="Inf" & deprel="xcomp"] & 2.head=1.id & 3.head=1.id within <s/>

- Finds sentences with verbs vidět 'see' or slyšet 'hear' governing an accusative object and an infinitive xcomp. There can be any number of other words between these tokens, but only within the sentence.

[voice="Act" & aux_feats="Mood=Cnd" & aux_feats="Tense=Past"]

– Finds sentences including a verb in the active voice and past conditional mood, e.g. Kdybych si nebyl oholil knír … 'If I hadn't shaved my moustache…'

[voice="Pass" & aux_feats="Mood=Cnd" & aux_feats=".*Tense=Past.*Tense=Past.*"]

– Finds sentences including a verb in the passive voice and past conditional mood, e.g. … aféra by byla bývala ututlána. '… the scandal would have been hushed up.'⁸⁾ ⁹⁾

[feats="VerbForm=Ger" & aux_feats="VerbForm=Fin" & aux_feats="VerbForm=Part"]

– In English: finds sentences including continuous perfect forms (both present and past), e.g. … has been constantly increasing in velocity.

Description of the list of attributes

In Attribute list by language, all attributes used in the corpus are listed.
Columns indicate whether the attribute is used for the language specified by the abbreviation in the header.
Attributes are divided into four categories, distinguished by background color.

Basic attributes

These 12 attributes are on the light purple background.
They consist of the following items: word form, lemma, part of speech, morphological categories, token order in a sentence, head reference and syntactic function.
They are usually taken directly from the output of the tool UDPipe. The format of the output is CoNLL-U.
There are two added attributes: lc and lc_lemma, which repeat word form and lemma without any capital letters.
For languages with multipart tokens (aggregates), there are also two additional sword and iword attributes.
The sword attribute includes the word form of the aggregate split by the “|” character into parts corresponding to syntactic words as they occur outside an aggregate, e.g. for nač and abychom the values of sword equal na|co and aby|bychom.
The iword attribute splits the aggregate into parts without any modification, for the tokens nač and abychom the values of iword egual na|č and a|bychom.

Structural attributes

These 7 attributes are on the light blue background.
They extend the reference to the token's syntactic governor (head) by additional attributes, making it easier to identify the head and its properties.
All attributes of this type are avaliable for all languages.

Function word attributes

These attributes are on the light green background.
They are given within the content word in order to specify the essential properties of the dependent function word.
The total number of function word attributes is 20, but no language uses them all.
Attributes refer to 6 types of auxiliary words, determined by their syntactic function in relation to the semantic word.
For each function word, the lemma, part of speech, morphological categories and subtype of the function word can be specified.
An attribute name consists of the name of the function word's syntactic function and the name of its property (attribute).
Unused or uninformative attributes are absent for the given language. There are four possible combinations which do not occur in any language.
Most languages (35) use the attribute case_lemma (lemma of apposition, most often prepositions), followed by mark_lemma (lemma of subordinate conjunctions, in 33 languages).
The clf_lemma (lemma of classifier) attribute only appears in Chinese.
If there are several auxiliaries of the same type for a content word, their values are separated by the “|” character.

Attributes representing selected categories

On the light brown background, there is a selection of 18 attributes from the feats list.
Only Latvian uses them all, while Maltese uses none. In addition to the language type, their presence or absence also depends on the availability of the category in the UD data.

Errors and shortcomings of linguistic annotation according to UD

POS and morphological categories do not match
Inconsistencies in the application of the principles of uniform classification of phenomena in all languages
Errors and inconsistencies in the given language (e.g. udělals as a unitary token)

The quality of annotations in different languages differs mainly in the volume and quality of training data. It is also affected by the method and tool used for annotation.

We will be grateful for every reported error, discrepancy, deficiency, comment and suggestion at the address CNC user support. Please include the abbreviation “UD” at the beginning of the message subject.

References

Selection of literature about UD

Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel Zeman (2021): Universal Dependencies. In: Computational Linguistics, ISSN 1530-9312, vol. 47, no. 2, pp. 255-308.

Daniel Zeman (2018): The World of Tokens, Tags and Trees. ISBN 978-80-88132-09-7.

For a complete list, see here.

Tutorials and lectures about UD

Daniel Zeman: Universal Dependencies and the Slavic Languages. Warsaw, 19.11.2018.

Joakim Nivre, Daniel Zeman, Filip Ginter, Francis M. Tyers: Tutorial on Universal Dependencies: Adding a new language to UD

Anna Nedoluzhko, Michal Novak, Martin Popel, Zdenek Zabokrtsky and Daniel Zeman: Coreference meets Universal Dependencies. Prague, 19/04/2021.

Daniel Zeman: Reflexives in Universal Dependencies. Prague, 04/03/2019.

¹⁾

The tool uses all data for the given language, ie all treebanks listed on https://lindat.mff.cuni.cz/services/udpipe/IUDPipe. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.

²⁾

Note that for technical reasons the names of the categorial attributes are all in lower case, including names such as VerbForm (in feats), rendered as verb_form, or NumType, rendered as num_type. The attribute values, such as Fem, retain the initial upper case character, but are enclosed in double quotes, like other non-embedded attributes.

³⁾

Aggregates are present in the following languages: ar, ca, cs, de, el, en, es, fi, fr, he, it, pl, pt, tr and uk. A list of all aggregates for a given language is displayed as the frequency distribution of word forms following the query [sword = ".|.+"].

⁴⁾

The first form, preceding the dash, is the original form, i.e. the value of the iword attribute, the second form, after the dash, is the reconstructed form, i.e. the value of the sword attribute. If a parenthesis includes just one form, the two options are identical, or the given language does not provide reconstructed forms.

⁵⁾

The constituent performing the given function is highlighted. If the constituent consists of more than one word, the constituent's governor (head word) is underlined. It is this token which is annotated by the given function.

⁶⁾

In a basic query, it is no longer necessary in some languages to separate parts of the aggregate with a space, eg był, by, and m of the Polish agglutinated form byłbym or is and n't of the English contraction isn't, even in a longer expression (aren't I). However, a basic query for is or n't will not show concordances including the for isn't.

⁷⁾

In the current release, only very few occurrences of personal pronouns in the subject position (such as she) are annotated as case="Nom".

⁸⁾

The form of the content verb used in the periphrastic passive has an adjectival lemma, e.g. ututlaný 'hushed', the adjectival POS upos=ADJ and its morphological categories include the featuresfeats="...Variant=Short|VerbForm=Part|Voice=Pass". On the other hand, reflexive passive, e.g. oholil se '[he] shaved himself', is annotated as feats="...Voice=Act".

⁹⁾

According to the UD guidelines, function words are immediate dependents on the relevant content word. In InterCorp 13ud, values of the feats attribute specified in multiple function words dependent on a single content word governor are concatenated into a single value. If so, categories such as Tense can occur more than once in the value of such a feats attribute, because it originates in two or more auxiliaries, as in our example from byla '[she] was' and bývala '[she] used to be'. This double occurrence is what the query uses to target the presence of two auxiliaries. If a query looking for passive voice verbs would mention only [aux_feats="Tense=Past"], the result would include also present conditional forms, where the l-ové participle (the "Tense=Past" form) occurs just bonce as the passive auxiliary (… aféra by byla ututlána. 'the scandle would be hushed up.').

Trace: • verze13ud

Table of Contents