VerbForm
(in feats
), rendered as verb_form
, or NumType
, rendered as num_type
. The attribute values, such as Fem
, retain the initial upper case character, but are enclosed in double quotes, like other attribute values outside feats
.Universal Dependencies is a an open international project aiming at linguistic annotation consistent across different languages. Some recent versions of the InterCorp parallel corpus (13ud and 16ud) have been annotated in terms of morphological categories, syntactic functions and syntactic structure following the UD guidelines and using the tools developed within the UD project.
General guidelines for annotation are provided on the UD project website (UD Guidelines), including a detailed description of:
Key specifics of the UD annotation as used in InterCorp:
tag
attribute. In the UD annotation, these language-specific tags are retained for most languages, in the xpos
attribute. However, the UD word class and morphological categories, denoted uniformly for all languages, are listed separately as values of the upos
and feats
attributes (see below Parts of speech, and Other categories, respectively). Frequently used morphological categories from the feats
list have been promoted to the status of regular attributes at the same level as upos
. This applies, for example, to morphological case, number, gender or person (case
, number
, gender
, person
). deprel
– see Syntactic functions) and its syntactic governor in the dependency tree (head
). To facilitate orientation in the syntactic structure, each word is also annotated with references to important properties of its head (lemma, part of speech and morphological categories), see References to syntactic head. If a content word occurs with a function word (eg. preposition, auxiliary verb, subordinate conjunction, determiner), the content word includes some properties of the function word (see References to function words).Insert tag
function, which inserts a UD POS (upos
) and any category from the feats
list into the query. The Insert tag
feature is available for all linguistically annotated languages.upos
attribute. upos
are the same for all languages.upos
, most languages provide a language-specific morphological tag, as the value of the xpos
attribute. The xpos
value is usually identical to a corresponding tag from the other, non-UD-based versions of InterCorp. upos | gloss |
---|---|
ADJ | adjective |
ADP | adposition (incl. preposition) |
ADV | adverb |
AUX | auxiliary verb |
CCONJ | coordinating conjuction |
DET | determiner |
INTJ | interjection |
NOUN | noun |
NUM | numeral |
PART | particle |
PRON | pronoun |
PROPN | proper noun |
PUNCT | punctuation |
SCONJ | subordinating conjunction |
SYM | symbol |
VERB | verb |
X | other |
feats
attribute. Their choice and values are determined by part of speech and language. Number=Sg
.feats
attribute.feats
attribute are separated by “|”, e.g. the Russian form школы /'ʂkolɨ/ 'school' in genitive singular is marked as feats="Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing"
. [upos="NOUN" & feats="Number=Sing"]
. The Russian form is found following the query [upos="NOUN" & feats="Gender=Fem" & feats="Case=Gen"]
. The order of categories in the query is irrelevant. feats
can also be treated as a string of characters using regular expressions, e.g. [upos="NOUN" & feats=".*Case=Gen.*Gender=Fem.*"]
. Here the order of categories in the query should correspond to their order in the corpus. The result is the same in both cases.feats
are listed also outside the list as categorial attributes at the same level as upos
. As a result, a query for a singular noun can be simply as follows: [upos="NOUN" & number="Sing"]
. Similarly, the query for the Russian form [upos="NOUN" & gender="Fem" & case="Gen"]
gives the same result as the two queries above. Categorial attributes can be also used to generate frequency lists.1) Such attributes appear on the light brown background in Attribute list by language or in KonText in the lower part of the table shown in View
/ Corpus-specific settings…
.category | gloss | example values |
---|---|---|
Abbr | abbreviation | Yes |
Animacy | animacy | Anim, Inan, Hum, Nhum |
Aspect | aspect | Imp, Perf, Hab, Iter, Prog, Prosp |
Case | case | Nom, Gen, Dat, Acc, Voc, Loc, Ins, … |
Definite | definiteness | Ind, Def, … |
Degree | degree | Pos, Cmp, Sup, Equ, Abs |
Foreign | foreign word | Yes |
Gender | gender | Fem, Masc, Neut, Com |
Mood | mood | Ind, Imp, Cnd, … |
NumType | numeral type | Card, Ord, Mult, Frac, Sets, … |
Number | number | Sing, Plur, Dual, Ptan, Coll, … |
Person | person | 1, 2, 3, … |
Polarity | polarity | Neg, Pos |
Polite | politeness | Infm, Form, Elev, Humb |
Poss | possessiveness | Yes |
PronType | type of pronoun etc. | Prs, Rcp, Art, Int, Rel, Exc, Dem, Emp, Tot, Ind |
Reflex | reflexiveness | Yes |
Tense | tense | Pres, Past, Fut, Pqp, Imp |
Typo | typo | Yes |
VerbForm | verb form | Fin, Inf, Part, Conv, Ger, Vnoun, Sup |
Voice | voice | Act, Pass, Mid, Cau, … |
lemma
attribute, with the “|” sign as the separator. It is therefore possible to search for them like other words, by typing the full form into the search box in a simple query (e.g. ses in Czech, can't in English or byłbym in Polish), or in the advanced query using the CQL search language give the same strings as the value of the word
attribute .iword
attribute shows the original form is|n't
or se|s
, while the sword
attribute shows the unabbreviated, “reconstructed” version: is|not
or se|jsi
.2)is|n't
– is|not
) or cannot (can|not
),3) in Czech there are tokens such as abychom (a|bychom
– aby|bychom
), bylas (byla|s
– byla|jsi
) or oč (o|č
– o|co
); in German zur (zu|r
– zu|der
) or am (a|m
– an|dem
); in Polish miałam (miała|m
), żebyś (że|by|ś
) or chciałbym (chciał|by|m
); in French des (de|s
– de|les
), aux (au|x
– à|les
) or auquel (au|quel
– à|lequel
).deprel
) and a reference to its syntactic governor (head
).acl:relcl
indicates an attribute expressed by a relative clause. The list below contains only subtypes relevant to English and represented in the corpus. Functions with subtypes for all languages are listed at Universal Dependency Relations.acl
, whether or not the deprel has a subtype, use the expression deprel="acl.*"
instead of deprel="acl"
. To find all auxiliary verbs, use the expression deprel="aux.*"
instead of deprel="aux"
. To find all subjects, use the expression deprel="nsubj.*"
.deprel="conj"
. The syntactic function of the entire coordination is thus specified by the deprel
attribute of the first conjunct, the head of all other conjuncts. To query the “true” deprel of a non-initial conjunct (deprel="conj"
), use the p_deprel
attribute. See Coordination below for details.deprel | gloss | example4) |
---|---|---|
acl | adnominal clause, finite or non-finite | The convent of the Poor Clares, known as the Minories, was destroyed to make way for storehouses. |
acl:relcl | relative adnominal clause | London has always been a vast ocean in which survival is not certain. |
advcl | adverbial clause | The country will pay a heavy price if the president’s obsessions prevail for long. |
advmod | adverbial modifier | They were all corrupt opportunists. Gorshkov knew where that idea came from . |
amod | adjectival modifier | The sustainable future of humanity is at stake. |
appos | apposition | They were going to a new home, a house of her choosing. |
aux | auxiliary verb | We have made our voice heard by the world. It's going to work. You can't start improvising now. |
aux:pass | passive auxiliary | Men like that are born only once. Who else should I get dressed up for if not her? |
case | case marking (incl. preposition) | Karpov's own career might hang in the balance |
cc | coordinating conjunction | I now invite you all to eat, drink, and make yourselves at home! |
cc:preconj | preconjunct | They are poisoning both the water and the soil. |
ccomp | clausal complement | I doubt whether the new model is an improvement. |
clf | classifier | 三个学生 sān gè xuéshēng |
compound | compound | In Gondor ten thousand years would not suffice. |
compound:prt | phrasal verb particle | He laid out the city’s streets and rebuilt its walls. |
conj | non-initial conjunct | You have two parents and you always will have. |
cop | copula | Where's the rest of your luggage? |
csubj | clausal subject, finite or nonfinite | It's quite easy to clear up these contradictions. But the most important thing is you shouldn't lose too much time. |
csubj:pass | clausal subject of passive clause | Taking notes has been banned. |
dep | unspecified dependency | By the 1860s, the South was utterly flush with cash. My dad doesn't really not that good. |
det | determiner | What way they went I don’t know and no rabbit knows . |
det:predet | predeterminer | People get sick all the time. |
discourse | discourse element | ‘Yes, please,’ said Ron. Oh dear, what a bore! |
dislocated | dislocated elements | Dumplings I like. |
expl | expletive | There is a ghost in the room. |
fixed | non-initial parts of fixed multiword unit | At least there's one of you brave enough! Of course there may be exceptions. |
flat | non-initial parts of flat multiword unit | Let's go to San Francisco. What was Miss O'Hara up to? |
flat:foreign | non-initial parts of flat multiword unit | During the colonial period it was called the Portal de los Mercaderes . |
goeswith | non-initial parts of incorrectly split form | They come here with out legal permission. |
iobj | indirect object | He brought us eggs. Can I buy you a drink? |
list | non-initial parts of list | Steve Jones tel.: 555-9814 e-mail: jones@abc.edf |
mark | marker | I spent the night telling jokes to keep Petrik from falling asleep at the wheel. I just want to know what you are thinking about when you wake up. |
nmod | nominal modifier | Did they put some fish near the infant's grave for his journey into the afterlife ? |
nmod:npmod | noun phrase as adverbial modifier | He was younger then and a lot more agile. It seemed that everyone had trembling hands and tear-filled eyes. |
nmod:poss | possessive nominal modifier | Many saw it as a good thing that her show was taken off the air. |
nmod:tmod | temporal modifier | In Plenary today I supported the amendment. |
nsubj | nominal subject | Those who venture upon its currents look for prosperity or fame, even if they often founder in its depths. |
nsubj:pass | nominal subject of passive clause | The horses were adorned with just one red scarf. |
nummod | numeric modifier | Dissolution does but give birth to fresh modes of organization, and one death is the parent of a thousand lives. |
obj | object | But who can stop the people? What do you mean? I don't know what to do. |
obl | oblique nominal | We might bring an avalanche down on ourselves for no good reason . |
obl:npmod | noun phrase as oblique nominal | I get fed up a little sometimes. |
obl:tmod | temporal modifier | I leave tomorrow. Tell him everything, tonight. |
orphan | orphan after elided head | Mary won gold and Peter bronze. |
parataxis | parataxis (incl. parentheticals) | “Is that the only reason?” she asked, putting her eyes close to mine. |
punct | punctuation | Is that all? |
reparandum | overridden disfluency | Go to the right- to the left. |
root | root | This was not a good moment in the history of English cuisine. |
vocative | vocative | See you later, Sam. |
xcomp | open clausal complement | Maria saw me standing at the mirror. |
head
as the word ID of the head, i.e. its word order position within the sentence, or parent
as its position relative to the given word), some other attributes of the head are listed for each token: lemma (p_lemma
), POS (p_upos
), morphological category (p_feats
), and syntactic function (p_deprel
).case_lemma
, morphological categories of an auxiliary by aux_feats
, morphological categories of a copula by cop_feats
, part of speech of a determiner by det_upos
, lemma of a marker by mark_lemma
.syn2020
).deprel
attribute: aux
(auxiliaries), case
(prepositions), mark
(markers), cop
(copula), det
(determiners), and clf
(classifiers). lemma
, upos
, feats
and a more detailed specification of type
, e.g. aux_type="pass"
(see passive auxiliary), or det_type="numgov"
(see pronominal quantifier governing the case of the noun). deprel
and attribute. For example, case_lemma
specifies the lemma of the noun or pronoun's preposition, the aux_feats
attribute of a content verb specifies morphological categories of its auxiliary.|
”, then appear in the appropriate attribute. The feats
attribute values from multiple auxiliary verbs dependent on a single meaning are combined into a single value where some categories, such as verb form specifications, may be repeated because they come from more than one form. For example, in the sentence who would have guessed that, the aux_feats
of the content verb guessed are composed of the feats of the auxiliary verbs would (Mood=Ind|Person=3|Tense=Past|VerbForm=Fin
) and have (VerbForm=Inf
).conj
.cc
.e_id
attribute refers to its identifier (the sequence number of the token representing the head within the sentence), the eparent
attribute to its position relative to the token. e_deprel
attribute whose value equals deprel
of the given token, except when the token is a non-initial conjunct, i.e. when its deprel
equals conj
. Then the value of e_deprel
equals the value of p_deprel
, i.e. shows the syntactic function of the whole coordination.e_deprel
attribute has the same value as p_deprel
also when the deprel
attribute equals fixed
, flat
, compound
or list
. Tokens within such constructions can also be found using the syntactic function of the whole construction, i.e. the e_deprel
attribute. e_deprel
attribute is not available, the solution is to use the p_deprel
attribute. This attribute shows the syntactic function of the token's head. For example, a query for all direct objects, including coordinated ones, can be formulated using the disjunction operator (|) as follows: [deprel="obj" | deprel="conj" & p_deprel="obj"]
. xpos
attribute instead of the tag
attribute. E.g. the query on feminine nouns in the vocative singular in Czech can be entered as follows: [xpos = "NNFS5.*"].upos
and feats
, respectively. Their values can be entered using the Insert tag
function. upos
) are the same for all languages. E.g. a query for proper names without using the Insert tag
function can be specified as follows: [upos = "PROPN"].feats
attribute. Some of them are available separately under categorial attributes. For details see Other categories above. Insert tag
function, which lets you select the POS and/or the values of the relevant categories (properties) from the feats
list in all linguistically annotated languages. The offer of properties for a given POS is determined by their actual occurrence in the corpus, so the list may reflect incorrect combinations.deprel
attribute (see Syntactic functions above.Formatted text
in the context box header, a concordance will appear along with the nearest context in a form that is close to the typography of the original text. For example, there are no spaces between the end of a word and punctuation, and paragraphs are separated by a blank line.iword
attribute). After clicking on such a node, in addition to the lemma of the given part of the multi-word token, its full form (as a separate word, the sword
attribute) and the word form of the entire token (word
) also appear.The queries mainly show the possibilities of using syntactic functions in connection with parts of speech and morphological categories, but also include references to syntactis heads and dependent auxiliaries. Most of the queries concern English, but they are also applicable to other languages, although the specific language may require some modifications to the query. Queries can be entered in one language, or in two or more languages in parallel.
[deprel="nsubj" & p_lemma="sing"]
Frequency / Lemmas
).[deprel="nsubj" & lemma="bird"]
p_lemma
attribute (in the KonText menu: Frequency / Custom... / Attribute: p_lemma
).[case_lemma="about" & case="Acc"]
case="Acc"
. For nouns, the case
attribute is not specified.6) case="Acc"
. The query [case_lemma="about"]
finds all nominals governing the preposition about, i.e. all nominals in prepositional phrases beginning with this preposition, including sentences such as ‘May I ask what this is all about, sir?’ said Bigwig.p_lemma
attribute (in the KonText menu: Frequency / Custom... / Attribute: p_lemma
).[deprel="iobj"]
p_lemma
.deprel="obl"
or (preferably but not obligatorily) deprel="obl:arg"
. For more details see Core Arguments vs. Oblique Modifiers.[e_deprel="i?obj"]
[deprel="i?obj" | deprel="conj" & p_deprel="i?obj"]
e_deprel
attribute is not available.deprel
denotes the direct or indirect object (deprel="i?obj"
, or – equivalently – deprel="obj|iobj"
), or the keyword's deprel
is conj
(deprel="conj"
) and depends on a direct or indirect object (p_deprel="i?obj"
), i.e. it is the non-initial conjunct in a coordinated constituent functioning as direct or indirect object.e_deprel
attribute in a simpler query:[deprel="nsubj" & upos="PROPN" | deprel="conj" & p_deprel="nsubj" & upos="PROPN"]
e_deprel
attribute:[e_deprel="nsubj" & upos="PROPN"]
[verb_form="Ger" & mark_lemma="with"]
p_lemma
.p_upos
and p_feats
.1:[lemma="feel|sense|perceive"] []* 2:[deprel="obj"] []* 3:[verb_form="Inf" & deprel="xcomp"] & 2.head=1.id & 3.head=1.id within <s/>
xcomp
. There can be any number of other words between these tokens, but only within a single sentence, as in Karras felt the pulse rate suddenly drop.[voice="Pass" & aux_feats="Mood=Cnd" & aux_feats=".*Tense=Past.*Tense=Past.*"]
upos=ADJ
and its morphological categories include the featuresfeats="...Variant=Short|VerbForm=Part|Voice=Pass"
. On the other hand, reflexive passive, e.g. oholil se '[he] shaved himself', is annotated as feats="...Voice=Act"
.feats
attribute specified in multiple function words dependent on a single content word governor are concatenated into a single value. If so, categories such as Tense can occur more than once in the value of such a feats
attribute, because it originates in two or more auxiliaries, as in our example from byla '[she] was' and bývala '[she] used to be'. [aux_feats="Tense=Past"]
, the result would include also present conditional forms, where the l-participle (the "Tense=Past"
form) occurs just once as the passive auxiliary (… aféra by byla ututlána. 'the scandle would be hushed up.').[feats="VerbForm=Part" & aux_feats=".*Tense=Past.*VerbForm=Inf.*Tense=Past.*"]
[feats="VerbForm=Ger" & aux_feats="VerbForm=Fin" & aux_feats="VerbForm=Part"]
[aux_lemma="be" & aux_feats="Person=1" & aux_feats="Number=Sing" & aux_feats="VerbForm=Ger" & feats="VerbForm=Part"]
aux_feats="Person=1" & aux_feats="Number=Sing"
and an ing-form aux_feats="VerbForm=Ger"
, because a single form cannot be a gerund and in the first person at the same time. [feats="VerbForm=Part" & aux_lemma="have" & aux_lemma!="be|will|can|may|must" & aux_feats="Mood=Ind" & aux_feats="Tense=Past"]
aux_lemma!="be|will|can|may|must"
is necessary to exclude cases where have is not the only auxiliary verb dependent on the participle, as in you would have thought he had been bred up in the lyceum. On the other hand, be should not be included in the stoplist if passive past perfect (the last clause in the example) is also expected in in the results. word
and lc
attributes can be used to query these languages.lc
and lc_lemma
, which repeat word form and lemma without any capital letters.sword
and iword
attributes.sword
attribute includes the word form of the aggregate split by the “|” character into parts corresponding to syntactic words as they occur outside an aggregate, e.g. for nač and abychom the values of sword
equal na|co
and aby|bychom
.iword
attribute splits the aggregate into parts without any modification, for the tokens nač and abychom the values of iword
egual na|č
and a|bychom
.head
) by additional attributes, making it easier to identify the head and its properties.case_lemma
(lemma of apposition, most often prepositions), followed by mark_lemma
(lemma of subordinate conjunctions, in 33 languages).clf_lemma
(lemma of classifier) attribute only appears in Chinese.feats
list.The quality of annotations in different languages differs mainly in the volume and quality of training data. It is also affected by the method and tool used for annotation.
We will be grateful for every reported error, discrepancy, deficiency, comment and suggestion at the address CNC user support. Please include the abbreviation “UD” at the beginning of the message subject.
Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel Zeman (2021): Universal Dependencies. In: Computational Linguistics, ISSN 1530-9312, vol. 47, no. 2, pp. 255-308.
Daniel Zeman (2018): The World of Tokens, Tags and Trees. ISBN 978-80-88132-09-7.
For a complete list, see here.
Daniel Zeman: Universal Dependencies and the Slavic Languages. Warsaw, 19.11.2018.
Joakim Nivre, Daniel Zeman, Filip Ginter, Francis M. Tyers: Tutorial on Universal Dependencies: Adding a new language to UD
Anna Nedoluzhko, Michal Novak, Martin Popel, Zdenek Zabokrtsky and Daniel Zeman: Coreference meets Universal Dependencies. Prague, 19/04/2021.
Daniel Zeman: Reflexives in Universal Dependencies. Prague, 04/03/2019.
Olga Nádvorníková (2024): Analyse contrastive de la complexité syntaxique à l’aide de corpus parallèles. Translitteræ, Laboratoire LATTICE (Langues, Textes, Traitements informatiques et Cognition) – CNRS UMR 8094 (Centre national de la recherche scientifique: Unité mixte de recherche), ENS (L'École normale supérieure). Paris, 28/05/2024. Video, slides
Alexandr Rosen (2024): Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics. Instytut Slawistyki Zachodniej i Południowej, Uniwersytet Warszawski. Warszawa, 10/06/2024, slides.
Alexandr Rosen (2023). The InterCorp parallel corpus with a uniform annotation for all languages. Jazykovedný časopis, 74(1):254–265. Paper, slides.
VerbForm
(in feats
), rendered as verb_form
, or NumType
, rendered as num_type
. The attribute values, such as Fem
, retain the initial upper case character, but are enclosed in double quotes, like other attribute values outside feats
.iword
attribute, the second form, after the dash, is the reconstructed form, i.e. the value of the sword
attribute. If a parenthesis includes just one form, the two options are identical, or the given language does not provide reconstructed forms.case="Nom"
.