Table of Contents
Universal Dependencies – UD
Universal Dependencies is a an open international project aiming at linguistic annotation consistent across different languages. Some recent versions of the InterCorp parallel corpus (13ud and 16ud) have been annotated in terms of morphological categories, syntactic functions and syntactic structure following the UD guidelines and using the tools developed within the UD project.
General guidelines for annotation are provided on the UD project website (UD Guidelines), including a detailed description of:
- word types (Universal POS tags)
- morphological categories (Universal features)
- syntactic functions (Universal Dependency Relations)
Key specifics of the UD annotation as used in InterCorp:
- In other releases of InterCorp, word class and morphological categories of a word are specified as the value of the
tag
attribute. In the UD annotation, these language-specific tags are retained for most languages, in thexpos
attribute. However, the UD word class and morphological categories, denoted uniformly for all languages, are listed separately as values of theupos
andfeats
attributes (see below Parts of speech, and Other categories, respectively). Frequently used morphological categories from thefeats
list have been promoted to the status of regular attributes at the same level asupos
. This applies, for example, to morphological case, number, gender or person (case
,number
,gender
,person
). - For use in KonText, fused forms or aggregates, ie word forms composed of two or even three syntactic words, were modified as divided tokens. In English it concerns, for example, the forms isn't or cannot. For more details see Multi-part tokens below.
- Each word is assigned its syntactic function (
deprel
– see Syntactic functions) and its syntactic governor in the dependency tree (head
). To facilitate orientation in the syntactic structure, each word is also annotated with references to important properties of its head (lemma, part of speech and morphological categories), see References to syntactic head. If a content word occurs with a function word (eg. preposition, auxiliary verb, subordinate conjunction, determiner), the content word includes some properties of the function word (see References to function words). - Annotations between languages differ in the number of categorial attributes and in links to function words, see Description of the list of attributes below.
- KonText supports queries by word class and other morphological categories using the
Insert tag
function, which inserts a UD POS (upos
) and any category from thefeats
list into the query. TheInsert tag
feature is available for all linguistically annotated languages.
Morphological annotation
Parts of speech
- In UD, part of speech is listed separately from other categories as the value of the
upos
attribute. - Parts of speech given in
upos
are the same for all languages. - In addition to
upos
, most languages provide a language-specific morphological tag, as the value of thexpos
attribute. Thexpos
value is usually identical to a corresponding tag from the other, non-UD-based versions of InterCorp.
upos | gloss |
---|---|
ADJ | adjective |
ADP | adposition (incl. preposition) |
ADV | adverb |
AUX | auxiliary verb |
CCONJ | coordinating conjuction |
DET | determiner |
INTJ | interjection |
NOUN | noun |
NUM | numeral |
PART | particle |
PRON | pronoun |
PROPN | proper noun |
PUNCT | punctuation |
SCONJ | subordinating conjunction |
SYM | symbol |
VERB | verb |
X | other |
Other categories
- Other categories are embedded under the
feats
attribute. Their choice and values are determined by part of speech and language. - Each category is listed as a “<category name>=<category value>” pair, e.g.
Number=Sg
. - Identical or comparable morphological categories and their values are called the same in all languages.
- A list of such pairs is the value of the
feats
attribute. - Categories in the
feats
attribute are separated by “|”, e.g. the Russian form школы /'ʂkolɨ/ 'school' in genitive singular is marked asfeats="Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing"
. - In an advanced query using the CQL query language each category can be specified separately: the Czech form moře 'sea' is one of the answers to the query
[upos="NOUN" & feats="Number=Sing"]
. The Russian form is found following the query[upos="NOUN" & feats="Gender=Fem" & feats="Case=Gen"]
. The order of categories in the query is irrelevant. - The value of
feats
can also be treated as a string of characters using regular expressions, e.g.[upos="NOUN" & feats=".*Case=Gen.*Gender=Fem.*"]
. Here the order of categories in the query should correspond to their order in the corpus. The result is the same in both cases. - Some of the categories in
feats
are listed also outside the list as categorial attributes at the same level asupos
. As a result, a query for a singular noun can be simply as follows:[upos="NOUN" & number="Sing"]
. Similarly, the query for the Russian form[upos="NOUN" & gender="Fem" & case="Gen"]
gives the same result as the two queries above. Categorial attributes can be also used to generate frequency lists.1) Such attributes appear on the light brown background in Attribute list by language or in KonText in the lower part of the table shown inView
/Corpus-specific settings…
.
category | gloss | example values |
---|---|---|
Abbr | abbreviation | Yes |
Animacy | animacy | Anim, Inan, Hum, Nhum |
Aspect | aspect | Imp, Perf, Hab, Iter, Prog, Prosp |
Case | case | Nom, Gen, Dat, Acc, Voc, Loc, Ins, … |
Definite | definiteness | Ind, Def, … |
Degree | degree | Pos, Cmp, Sup, Equ, Abs |
Foreign | foreign word | Yes |
Gender | gender | Fem, Masc, Neut, Com |
Mood | mood | Ind, Imp, Cnd, … |
NumType | numeral type | Card, Ord, Mult, Frac, Sets, … |
Number | number | Sing, Plur, Dual, Ptan, Coll, … |
Person | person | 1, 2, 3, … |
Polarity | polarity | Neg, Pos |
Polite | politeness | Infm, Form, Elev, Humb |
Poss | possessiveness | Yes |
PronType | type of pronoun etc. | Prs, Rcp, Art, Int, Rel, Exc, Dem, Emp, Tot, Ind |
Reflex | reflexiveness | Yes |
Tense | tense | Pres, Past, Fut, Pqp, Imp |
Typo | typo | Yes |
VerbForm | verb form | Fin, Inf, Part, Conv, Ger, Vnoun, Sup |
Voice | voice | Act, Pass, Mid, Cau, … |
Multi-part tokens
- Some tokens, in the UD parlance called fused words, or aggregates in some Czech corpus-related literature, consist of multiple parts. These parts correspond to different nodes in the syntactic structure. In English, such tokens represent contractions, consisting of a verb and the negative particle such as isn't or cannot.
- The orthographic form of these words is preserved in the corpus, the individual parts are separated only in the annotation - e.g. in the value of the
lemma
attribute, with the “|” sign as the separator. It is therefore possible to search for them like other words, by typing the full form into the search box in a simple query (e.g. ses in Czech, can't in English or byłbym in Polish), or in the advanced query using the CQL search language give the same strings as the value of theword
attribute . - In some languages, including English and Czech, a part of the fused token has a different form when occurring in a different context as an orthographically separate word. E.g. n't, a part of isn't, corresponds to not, the Czech auxiliary clitic s, a part of ses, corresponds to jsi. Both variants are represented in the annotation: the
iword
attribute shows the original formis|n't
orse|s
, while thesword
attribute shows the unabbreviated, “reconstructed” version:is|not
orse|jsi
.2) - In addition to the English tokens isn't (
is|n't
–is|not
) or cannot (can|not
),3) in Czech there are tokens such as abychom (a|bychom
–aby|bychom
), bylas (byla|s
–byla|jsi
) or oč (o|č
–o|co
); in German zur (zu|r
–zu|der
) or am (a|m
–an|dem
); in Polish miałam (miała|m
), żebyś (że|by|ś
) or chciałbym (chciał|by|m
); in French des (de|s
–de|les
), aux (au|x
–à|les
) or auquel (au|quel
–à|lequel
).
Syntactic annotation
Syntactic functions
- Each token specifies its syntactic function, i.e. dependency relation (
deprel
) and a reference to its syntactic governor (head
). - The table below distinguishes four types of syntactic functions by different typeface:
- Common deprels are listed in bold.
- Deprels of function words are listed in bold italics.
- Deprels for representing coordination and similar phenomena in the dependency structure or for a technical purpose are set in italics.
- Deprels not used in English are listed in in gray.
- In some languages, some deprels may have subtypes. The subtype name follows the colon after the deprel name, e.g.
acl:relcl
indicates an attribute expressed by a relative clause. The list below contains only subtypes relevant to English and represented in the corpus. Functions with subtypes for all languages are listed at Universal Dependency Relations. - When querying a deprel that may have a subtype, a possible subtype should be taken into account. For example, to find all words with the deprel
acl
, whether or not the deprel has a subtype, use the expressiondeprel="acl.*"
instead ofdeprel="acl"
. To find all auxiliary verbs, use the expressiondeprel="aux.*"
instead ofdeprel="aux"
. To find all subjects, use the expressiondeprel="nsubj.*"
. - When a queried deprel targets a coordinated structure, only the first conjunct is found. The second and subsequent conjuncts are marked as
deprel="conj"
. The syntactic function of the entire coordination is thus specified by thedeprel
attribute of the first conjunct, the head of all other conjuncts. To query the “true” deprel of a non-initial conjunct (deprel="conj"
), use thep_deprel
attribute. See Coordination below for details.
deprel | gloss | example4) |
---|---|---|
acl | adnominal clause, finite or non-finite | The convent of the Poor Clares, known as the Minories, was destroyed to make way for storehouses. |
acl:relcl | relative adnominal clause | London has always been a vast ocean in which survival is not certain. |
advcl | adverbial clause | The country will pay a heavy price if the president’s obsessions prevail for long. |
advmod | adverbial modifier | They were all corrupt opportunists. Gorshkov knew where that idea came from . |
amod | adjectival modifier | The sustainable future of humanity is at stake. |
appos | apposition | They were going to a new home, a house of her choosing. |
aux | auxiliary verb | We have made our voice heard by the world. It's going to work. You can't start improvising now. |
aux:pass | passive auxiliary | Men like that are born only once. Who else should I get dressed up for if not her? |
case | case marking (incl. preposition) | Karpov's own career might hang in the balance |
cc | coordinating conjunction | I now invite you all to eat, drink, and make yourselves at home! |
cc:preconj | preconjunct | They are poisoning both the water and the soil. |
ccomp | clausal complement | I doubt whether the new model is an improvement. |
clf | classifier | 三个学生 sān gè xuéshēng |
compound | compound | In Gondor ten thousand years would not suffice. |
compound:prt | phrasal verb particle | He laid out the city’s streets and rebuilt its walls. |
conj | non-initial conjunct | You have two parents and you always will have. |
cop | copula | Where's the rest of your luggage? |
csubj | clausal subject, finite or nonfinite | It's quite easy to clear up these contradictions. But the most important thing is you shouldn't lose too much time. |
csubj:pass | clausal subject of passive clause | Taking notes has been banned. |
dep | unspecified dependency | By the 1860s, the South was utterly flush with cash. My dad doesn't really not that good. |
det | determiner | What way they went I don’t know and no rabbit knows . |
det:predet | predeterminer | People get sick all the time. |
discourse | discourse element | ‘Yes, please,’ said Ron. Oh dear, what a bore! |
dislocated | dislocated elements | Dumplings I like. |
expl | expletive | There is a ghost in the room. |
fixed | non-initial parts of fixed multiword unit | At least there's one of you brave enough! Of course there may be exceptions. |
flat | non-initial parts of flat multiword unit | Let's go to San Francisco. What was Miss O'Hara up to? |
flat:foreign | non-initial parts of flat multiword unit | During the colonial period it was called the Portal de los Mercaderes . |
goeswith | non-initial parts of incorrectly split form | They come here with out legal permission. |
iobj | indirect object | He brought us eggs. Can I buy you a drink? |
list | non-initial parts of list | Steve Jones tel.: 555-9814 e-mail: jones@abc.edf |
mark | marker | I spent the night telling jokes to keep Petrik from falling asleep at the wheel. I just want to know what you are thinking about when you wake up. |
nmod | nominal modifier | Did they put some fish near the infant's grave for his journey into the afterlife ? |
nmod:npmod | noun phrase as adverbial modifier | He was younger then and a lot more agile. It seemed that everyone had trembling hands and tear-filled eyes. |
nmod:poss | possessive nominal modifier | Many saw it as a good thing that her show was taken off the air. |
nmod:tmod | temporal modifier | In Plenary today I supported the amendment. |
nsubj | nominal subject | Those who venture upon its currents look for prosperity or fame, even if they often founder in its depths. |
nsubj:pass | nominal subject of passive clause | The horses were adorned with just one red scarf. |
nummod | numeric modifier | Dissolution does but give birth to fresh modes of organization, and one death is the parent of a thousand lives. |
obj | object | But who can stop the people? What do you mean? I don't know what to do. |
obl | oblique nominal | We might bring an avalanche down on ourselves for no good reason . |
obl:npmod | noun phrase as oblique nominal | I get fed up a little sometimes. |
obl:tmod | temporal modifier | I leave tomorrow. Tell him everything, tonight. |
orphan | orphan after elided head | Mary won gold and Peter bronze. |
parataxis | parataxis (incl. parentheticals) | “Is that the only reason?” she asked, putting her eyes close to mine. |
punct | punctuation | Is that all? |
reparandum | overridden disfluency | Go to the right- to the left. |
root | root | This was not a good moment in the history of English cuisine. |
vocative | vocative | See you later, Sam. |
xcomp | open clausal complement | Maria saw me standing at the mirror. |
References to syntactic heads
- In addition to the pointer to its head (
head
as the word ID of the head, i.e. its word order position within the sentence, orparent
as its position relative to the given word), some other attributes of the head are listed for each token: lemma (p_lemma
), POS (p_upos
), morphological category (p_feats
), and syntactic function (p_deprel
). - A token may also have attributes that specify the properties of a function word that depends on the token. For example, the lemma of a preposition is shown by the attribute
case_lemma
, morphological categories of an auxiliary byaux_feats
, morphological categories of a copula bycop_feats
, part of speech of a determiner bydet_upos
, lemma of a marker bymark_lemma
. - Similar means of representing syntactic structure are used by other syntactically annotated corpora available in the KonText browser (e.g.
syn2020
).
References to function words
- According to UD, function words include auxiliary verbs, adpositions, subordinating conjunctions, conjunctions, determiners, and quantifiers.
- Function words depend on the corresponding content words.
- Types of function words are specified by their syntactic function, i.e. by the value of the
deprel
attribute:aux
(auxiliaries),case
(prepositions),mark
(markers),cop
(copula),det
(determiners), andclf
(classifiers). - For each function word the content word governor may include the function word's
lemma
,upos
,feats
and a more detailed specification oftype
, e.g.aux_type="pass"
(see passive auxiliary), ordet_type="numgov"
(see pronominal quantifier governing the case of the noun). - The names of the corresponding content word attributes consist of the function word's
deprel
and attribute. For example,case_lemma
specifies the lemma of the noun or pronoun's preposition, theaux_feats
attribute of a content verb specifies morphological categories of its auxiliary. - A single content word can govern multiple function words, e.g. three for the passive present perfect conditional (she would have been pleased). The values of all the auxiliary words, separated by “
|
”, then appear in the appropriate attribute. Thefeats
attribute values from multiple auxiliary verbs dependent on a single meaning are combined into a single value where some categories, such as verb form specifications, may be repeated because they come from more than one form. For example, in the sentence who would have guessed that, theaux_feats
of the content verb guessed are composed of the feats of the auxiliary verbs would (Mood=Ind|Person=3|Tense=Past|VerbForm=Fin
) and have (VerbForm=Inf
).
Coordination
- The first conjunct depends on the governor of the entire coordination. Its syntactic function determines the syntactic function of the whole coordination.
- The second and subsequent conjuncts always depend on the first conjunct. Their syntactic function is specified as
conj
. - Conjunctions depend on the following conjunct. Their syntactic function is
cc
. - A reference to the so-called effective head is used to identify the head regardless of whether the token is a conjunct or not, or whether it is in the initial or non-initial conjunct: the
e_id
attribute refers to its identifier (the sequence number of the token representing the head within the sentence), theeparent
attribute to its position relative to the token. - In InterCorp release 16ud, there is an additional
e_deprel
attribute whose value equalsdeprel
of the given token, except when the token is a non-initial conjunct, i.e. when itsdeprel
equalsconj
. Then the value ofe_deprel
equals the value ofp_deprel
, i.e. shows the syntactic function of the whole coordination. - The
e_deprel
attribute has the same value asp_deprel
also when thedeprel
attribute equalsfixed
,flat
,compound
orlist
. Tokens within such constructions can also be found using the syntactic function of the whole construction, i.e. thee_deprel
attribute. - To find all words with a certain syntactic function, including those that are part of a coordination, in InterCorp release 13ud, where the
e_deprel
attribute is not available, the solution is to use thep_deprel
attribute. This attribute shows the syntactic function of the token's head. For example, a query for all direct objects, including coordinated ones, can be formulated using the disjunction operator (|) as follows:[deprel="obj" | deprel="conj" & p_deprel="obj"]
.
UD and KonText
Corpus Search
Basic query
- A basic query for a word form or phrase is entered in the same way as in previous releases of InterCorp.5)
Query for a lemma and a morphological tag
- As in previous releases of InterCorp, a lemma and a morphological tag can be entered in an advanced query. For most linguistically annotated languages (except be, da, en, fr, hu, no and ru) it is possible to enter a tag from a language-specific set (national tagset), usually identical to the set used in the previous releases of InterCorp for that language. Just use the
xpos
attribute instead of thetag
attribute. E.g. the query on feminine nouns in the vocative singular in Czech can be entered as follows: [xpos = "NNFS5.*"]. - According to UD, part of speech and morphological categories are listed separately as values of the attributes
upos
andfeats
, respectively. Their values can be entered using theInsert tag
function. - Parts of speech (
upos
) are the same for all languages. E.g. a query for proper names without using theInsert tag
function can be specified as follows: [upos = "PROPN"]. - Other morphological categories are listed under the
feats
attribute. Some of them are available separately under categorial attributes. For details see Other categories above.
Query for a part of speech and morphological categories using the menu
- When entering an advanced query, you can use the
Insert tag
function, which lets you select the POS and/or the values of the relevant categories (properties) from thefeats
list in all linguistically annotated languages. The offer of properties for a given POS is determined by their actual occurrence in the corpus, so the list may reflect incorrect combinations.
Query for a syntactic function
- Syntactic function is specified for each token as the value of the
deprel
attribute (see Syntactic functions above. - E.g. a query to show the occurrences of the verb run in the function of the governor of an adnominal clause, is entered as [lemma="run" & deprel="acl"]. Results include examples such as Everyone of the rabbits was seized by the instinct to run away, to go underground. Some people have the idea that rabbits spend a good deal of their time running away from foxes.
Query results
Formatted text
- After clicking on the keyword and
Formatted text
in the context box header, a concordance will appear along with the nearest context in a form that is close to the typography of the original text. For example, there are no spaces between the end of a word and punctuation, and paragraphs are separated by a blank line.
Syntactic structure display
- After clicking on the syntax tree icon at the beginning of each concordance line, the syntactic structure of the sentence is displayed. For each node, the word form, POS and syntactic function of the word relative to the given token are given. After clicking on the node, other annotation will appear, especially the lemma of the form.
- Multi-part tokens (aggregates) are divided into multiple nodes and the word form then corresponds to the relevant part of the token (the
iword
attribute). After clicking on such a node, in addition to the lemma of the given part of the multi-word token, its full form (as a separate word, thesword
attribute) and the word form of the entire token (word
) also appear. - In the text line above the structure and in the structure, under the cursor the relevant strings and nodes are highlighted in parallel.
Examples of queries
The queries mainly show the possibilities of using syntactic functions in connection with parts of speech and morphological categories, but also include references to syntactis heads and dependent auxiliaries. Most of the queries concern English, but they are also applicable to other languages, although the specific language may require some modifications to the query. Queries can be entered in one language, or in two or more languages in parallel.
Who are the most likely singers
[deprel="nsubj" & p_lemma="sing"]
- This query finds subjects of the verb sing. One of the results is the sentence The birds sing sweetly in these trees.
- The most frequent lexemes filling the subject slot of sing can be found from the list of keyword lemmas (in the KonText menu:
Frequency / Lemmas
).
What birds do most often
[deprel="nsubj" & lemma="bird"]
- This query finds occurrences of bird(s) as the subject. The query finds e.g. the sentence A few birds flew off in disgust.
- The verbs governing the subject can be listed using in the frequency distribution according to the
p_lemma
attribute (in the KonText menu:Frequency / Custom... / Attribute: p_lemma
).
Nouns following a specific preposition
[case_lemma="about" & case="Acc"]
- This query finds accusative nominals, i.e. pronominal forms such as her or themselves, preceded by the preposition about. In English, only such forms are annotated as
case="Acc"
. For nouns, thecase
attribute is not specified.6) - To extend the search to all nouns, drop
case="Acc"
. The query[case_lemma="about"]
finds all nominals governing the preposition about, i.e. all nominals in prepositional phrases beginning with this preposition, including sentences such as ‘May I ask what this is all about, sir?’ said Bigwig. - The governing verbs can be listed using frequency distribution according to the
p_lemma
attribute (in the KonText menu:Frequency / Custom... / Attribute: p_lemma
). - The query does not assume any specific word order, the noun could also precede the preposition, which would indeed be highly unlikely.
Verbs taking an indirect object
[deprel="iobj"]
- This query finds indirect objects.
- The lemma of the indirect object's head can be listed using frequency distribution according to the attribute
p_lemma
. - Note that in UD, dative complements in languages such as German or Czech are non-core dependents. As such, they should be labelled as
deprel="obl"
or (preferably but not obligatorily)deprel="obl:arg"
. For more details see Core Arguments vs. Oblique Modifiers.
Direct or indirect objects, also as conjuncts
[e_deprel="i?obj"]
- This query finds direct or indirect objects, even as non-initial conjuncts, e.g. in the sentence In Trump, they have found a shameless frontman and TV personality who will do their bidding.
- Note that for coordinated constituents, a separate concordance is shown for each conjunct.
[deprel="i?obj" | deprel="conj" & p_deprel="i?obj"]
- This query should be used in 13ud, where the
e_deprel
attribute is not available. - Either the keyword's
deprel
denotes the direct or indirect object (deprel="i?obj"
, or – equivalently –deprel="obj|iobj"
), or the keyword'sdeprel
isconj
(deprel="conj"
) and depends on a direct or indirect object (p_deprel="i?obj"
), i.e. it is the non-initial conjunct in a coordinated constituent functioning as direct or indirect object. - In 16ud we get the same result using the
e_deprel
attribute in a simpler query:
Proper nouns as subjects, also as conjuncts
[deprel="nsubj" & upos="PROPN" | deprel="conj" & p_deprel="nsubj" & upos="PROPN"]
- This query finds proper nouns as subjects, including non-initial conjuncts.
- Concordances include sentences such as And what does Crump say? or “I never even saw her,” said Pat.
- In 16ud, the same query can be simplified using the
e_deprel
attribute:
[e_deprel="nsubj" & upos="PROPN"]
Gerunds preceded by "with" as the marker
[verb_form="Ger" & mark_lemma="with"]
- This query finds ing-forms, heading non-finite clauses and preceded by with, as in With the Italian front collapsing, they need him elsewhere.7)
- The lemma of the gerund's head (often a finite verb) can be listed using frequency distribution according to the attribute
p_lemma
. - Its part of speech and other categories can be listed using frequency distribution according to the attributes
p_upos
andp_feats
.
Verbs of sensing followed by an object and an infinitive
1:[lemma="feel|sense|perceive"] []* 2:[deprel="obj"] []* 3:[verb_form="Inf" & deprel="xcomp"] & 2.head=1.id & 3.head=1.id within <s/>
- This query finds sentences with verbs feel, sense or perceive governing an object and an infinitive
xcomp
. There can be any number of other words between these tokens, but only within a single sentence, as in Karras felt the pulse rate suddenly drop. - The query uses two global conditions to make sure that the object and the infinitive depend on the sending verb, i.e. that the two dependents point to the ID of the sensing verb is its head (syntactic governor).
Past conditional passive in Czech
[voice="Pass" & aux_feats="Mood=Cnd" & aux_feats=".*Tense=Past.*Tense=Past.*"]
- This query finds sentences including a verb in the passive voice and past conditional mood, e.g. … aféra by byla bývala ututlána. '… the scandal would have been hushed up.'
- The form of the content verb used in the periphrastic passive has an adjectival lemma, e.g. ututlaný 'hushed', the adjectival POS
upos=ADJ
and its morphological categories include the featuresfeats="...Variant=Short|VerbForm=Part|Voice=Pass"
. On the other hand, reflexive passive, e.g. oholil se '[he] shaved himself', is annotated asfeats="...Voice=Act"
. - According to the UD guidelines, function words are immediate dependents on the relevant content word. In InterCorp 13ud, values of the
feats
attribute specified in multiple function words dependent on a single content word governor are concatenated into a single value. If so, categories such as Tense can occur more than once in the value of such afeats
attribute, because it originates in two or more auxiliaries, as in our example from byla '[she] was' and bývala '[she] used to be'. - This double occurrence is what the query uses to target the presence of two auxiliaries. If a query looking for passive voice verbs would mention only
[aux_feats="Tense=Past"]
, the result would include also present conditional forms, where the l-participle (the"Tense=Past"
form) occurs just once as the passive auxiliary (… aféra by byla ututlána. 'the scandle would be hushed up.').
Past conditional passive in English
[feats="VerbForm=Part" & aux_feats=".*Tense=Past.*VerbForm=Inf.*Tense=Past.*"]
- This query finds sentences such as He would not have been annihilated but enslaved, and Barad-dûr would not have been destroyed but occupied.
Continuous perfect
[feats="VerbForm=Ger" & aux_feats="VerbForm=Fin" & aux_feats="VerbForm=Part"]
- This query finds sentences including continuous perfect forms (both present and past), e.g. … has been constantly increasing in velocity.
Passive of first person singular continuous
[aux_lemma="be" & aux_feats="Person=1" & aux_feats="Number=Sing" & aux_feats="VerbForm=Ger" & feats="VerbForm=Part"]
- This query finds passive participles preceded by the continuous form of the be auxiliary in first person singular, as in I’m not exactly being given much help..
- Note that the query succedes only when there are two distinct auxiliaries dependent on the content verb participle, namely a finite form
aux_feats="Person=1" & aux_feats="Number=Sing"
and an ing-formaux_feats="VerbForm=Ger"
, because a single form cannot be a gerund and in the first person at the same time.
Past perfect
[feats="VerbForm=Part" & aux_lemma="have" & aux_lemma!="be|will|can|may|must" & aux_feats="Mood=Ind" & aux_feats="Tense=Past"]
- This query finds sentences including a verb in past perfect, e.g. Yet the streets were by then safer than they had ever been, but also It was as if the devil had suddenly re-emerged in the roar of the flames.
- The specification
aux_lemma!="be|will|can|may|must"
is necessary to exclude cases where have is not the only auxiliary verb dependent on the participle, as in you would have thought he had been bred up in the lyceum. On the other hand, be should not be included in the stoplist if passive past perfect (the last clause in the example) is also expected in in the results.
Description of the list of attributes
- In Attribute list by language in 13ud or Attribute list by language in 16ud all attributes used in the specific version are listed.
- Columns indicate whether the attribute is used for the language specified by the abbreviation in the header.
- Attributes are divided into four categories, distinguished by background color.
- For brevity, only linguistically annotated languages are included. E.g. the list for 16ud omits 14 languages denoted by the language codes bn, br, bs, eo, hs, ka, mk, ml, ms, rn, si, sq, th and tl. These languages can be queried Only the
word
andlc
attributes can be used to query these languages.
Basic attributes
- These 12 attributes are on the light purple background.
- They consist of the following items: word form, lemma, part of speech, morphological categories, token order in a sentence, head reference and syntactic function.
- There are two added attributes:
lc
andlc_lemma
, which repeat word form and lemma without any capital letters. - For languages with multipart tokens (aggregates), there are also two additional
sword
andiword
attributes. - The
sword
attribute includes the word form of the aggregate split by the “|” character into parts corresponding to syntactic words as they occur outside an aggregate, e.g. for nač and abychom the values ofsword
equalna|co
andaby|bychom
. - The
iword
attribute splits the aggregate into parts without any modification, for the tokens nač and abychom the values ofiword
egualna|č
anda|bychom
.
Structural attributes
- These attributes are on the light blue background.
- They extend the reference to the token's syntactic governor (
head
) by additional attributes, making it easier to identify the head and its properties. - All attributes of this type are avaliable for all languages.
Function word attributes
- These attributes are on the light green background.
- They are given within the content word in order to specify the essential properties of the dependent function word.
- The total number of function word attributes is 20, but no language uses them all.
- Attributes refer to 6 types of auxiliary words, determined by their syntactic function in relation to the semantic word.
- For each function word, the lemma, part of speech, morphological categories and subtype of the function word can be specified.
- An attribute name consists of the name of the function word's syntactic function and the name of its property (attribute).
- Unused or uninformative attributes are absent for the given language. There are four possible combinations which do not occur in any language.
- Most languages (35) use the attribute
case_lemma
(lemma of apposition, most often prepositions), followed bymark_lemma
(lemma of subordinate conjunctions, in 33 languages). - The
clf_lemma
(lemma of classifier) attribute only appears in Chinese. - If there are several auxiliaries of the same type for a content word, their values are separated by the “|” character.
Attributes representing selected categories
- On the light brown background, there is a selection of 18 attributes from the
feats
list. - Only Latvian uses them all, while Maltese uses none. In addition to the language type, their presence or absence also depends on the availability of the category in the UD data.
Errors and shortcomings of linguistic annotation according to UD
- POS and morphological categories do not match
- Inconsistencies in the application of the principles of uniform classification of phenomena in all languages
- Errors and inconsistencies in the given language (e.g. udělals as a unitary token)
The quality of annotations in different languages differs mainly in the volume and quality of training data. It is also affected by the method and tool used for annotation.
We will be grateful for every reported error, discrepancy, deficiency, comment and suggestion at the address CNC user support. Please include the abbreviation “UD” at the beginning of the message subject.
References
Selection of literature about UD
Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel Zeman (2021): Universal Dependencies. In: Computational Linguistics, ISSN 1530-9312, vol. 47, no. 2, pp. 255-308.
Daniel Zeman (2018): The World of Tokens, Tags and Trees. ISBN 978-80-88132-09-7.
For a complete list, see here.
Tutorials and lectures about UD
Daniel Zeman: Universal Dependencies and the Slavic Languages. Warsaw, 19.11.2018.
Joakim Nivre, Daniel Zeman, Filip Ginter, Francis M. Tyers: Tutorial on Universal Dependencies: Adding a new language to UD
Anna Nedoluzhko, Michal Novak, Martin Popel, Zdenek Zabokrtsky and Daniel Zeman: Coreference meets Universal Dependencies. Prague, 19/04/2021.
Daniel Zeman: Reflexives in Universal Dependencies. Prague, 04/03/2019.
About UD-annotated InterCorp
Olga Nádvorníková (2024): Analyse contrastive de la complexité syntaxique à l’aide de corpus parallèles. Translitteræ, Laboratoire LATTICE (Langues, Textes, Traitements informatiques et Cognition) – CNRS UMR 8094 (Centre national de la recherche scientifique: Unité mixte de recherche), ENS (L'École normale supérieure). Paris, 28/05/2024. Video, slides
Alexandr Rosen (2024): Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics. Instytut Slawistyki Zachodniej i Południowej, Uniwersytet Warszawski. Warszawa, 10/06/2024, slides.
Alexandr Rosen (2023). The InterCorp parallel corpus with a uniform annotation for all languages. Jazykovedný časopis, 74(1):254–265. Paper, slides.
VerbForm
(in feats
), rendered as verb_form
, or NumType
, rendered as num_type
. The attribute values, such as Fem
, retain the initial upper case character, but are enclosed in double quotes, like other attribute values outside feats
.iword
attribute, the second form, after the dash, is the reconstructed form, i.e. the value of the sword
attribute. If a parenthesis includes just one form, the two options are identical, or the given language does not provide reconstructed forms.case="Nom"
.