AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:pojmy:ud [2024/06/16 21:39] – [Coordination] alexandrrosenen:pojmy:ud [2024/10/08 21:50] (current) – [About UD-annotated InterCorp] alexandrrosen
Line 1: Line 1:
 ====== Universal Dependencies – UD ====== ====== Universal Dependencies – UD ======
  
-[[https://universaldependencies.org|Universal Dependencies]] is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. Some recent versions of the [[en:cnk:intercorp|InterCorp]] parallel corpus ([[en:cnk:intercorp:verze13ud|13ud]] and [[en:cnk:intercorp:verze13ud|13ud]]) has been annotated in terms of morphological categories, syntactic functions and syntactic structure following the UD guidelines and using the tools developed within the UD project.+[[https://universaldependencies.org|Universal Dependencies]] is a an open international project aiming at linguistic annotation consistent across different languages. Some recent versions of the [[en:cnk:intercorp|InterCorp]] parallel corpus ([[en:cnk:intercorp:verze13ud|13ud]] and [[en:cnk:intercorp:verze16ud|16ud]]) have been annotated in terms of morphological categories, syntactic functions and syntactic structure following the UD guidelines and using the tools developed within the UD project.
  
 General guidelines for annotation are provided on the UD project website ([[https://universaldependencies.org/guidelines.html|UD Guidelines]]), including a detailed description of: General guidelines for annotation are provided on the UD project website ([[https://universaldependencies.org/guidelines.html|UD Guidelines]]), including a detailed description of:
Line 12: Line 12:
   * For use in KonText, **fused forms** or //aggregates//, ie word forms composed of two or even three syntactic words, were modified as divided tokens. In English it concerns, for example, the forms //isn't// or //cannot//. For more details see [[en:pojmy:ud#multi-part_tokens|Multi-part tokens]] below.   * For use in KonText, **fused forms** or //aggregates//, ie word forms composed of two or even three syntactic words, were modified as divided tokens. In English it concerns, for example, the forms //isn't// or //cannot//. For more details see [[en:pojmy:ud#multi-part_tokens|Multi-part tokens]] below.
   * Each word is assigned its **syntactic function** (''deprel'' – see [[en:pojmy:ud#syntactic_functions|Syntactic functions]]) and its syntactic governor in the dependency tree (''head''). To facilitate orientation in the syntactic structure, each word is also annotated with references to important properties of its head (lemma, part of speech and morphological categories), see [[en:pojmy:ud#references_to_syntactic_heads|References to syntactic head]]. If a content word occurs with a **function word** (eg. preposition, auxiliary verb, subordinate conjunction, determiner), the content word includes some properties of the function word (see [[en:pojmy:ud#references_to_function_words|References to function words]]).   * Each word is assigned its **syntactic function** (''deprel'' – see [[en:pojmy:ud#syntactic_functions|Syntactic functions]]) and its syntactic governor in the dependency tree (''head''). To facilitate orientation in the syntactic structure, each word is also annotated with references to important properties of its head (lemma, part of speech and morphological categories), see [[en:pojmy:ud#references_to_syntactic_heads|References to syntactic head]]. If a content word occurs with a **function word** (eg. preposition, auxiliary verb, subordinate conjunction, determiner), the content word includes some properties of the function word (see [[en:pojmy:ud#references_to_function_words|References to function words]]).
-  * **Annotations between languages ​​differ** in the number of categorial attributes and in links to function words, see {{cnk:intercorp:ud_ic_attributes.pdf | List of attributes by language}}, described below in [[en:pojmy:ud#description_of_the_list_of_attributes|Description of the list of attributes]].+  * **Annotations between languages ​​differ** in the number of categorial attributes and in links to function words, see [[en:pojmy:ud#description_of_the_list_of_attributes|Description of the list of attributes]] below.
   * KonText supports queries by word class and other morphological categories using the **''Insert tag''** function, which inserts a UD POS (''upos'') and any category from the ''feats'' list into the query. The ''Insert tag'' feature is available for all linguistically annotated languages.   * KonText supports queries by word class and other morphological categories using the **''Insert tag''** function, which inserts a UD POS (''upos'') and any category from the ''feats'' list into the query. The ''Insert tag'' feature is available for all linguistically annotated languages.
 ===== Morphological annotation ===== ===== Morphological annotation =====
Line 175: Line 175:
   * A reference to the so-called effective head is used to identify the head regardless of whether the token is a conjunct or not, or whether it is in the initial or non-initial conjunct: the ''e_id'' attribute refers to its identifier (the sequence number of the token representing the head within the sentence), the ''eparent'' attribute to its position  relative to the token.    * A reference to the so-called effective head is used to identify the head regardless of whether the token is a conjunct or not, or whether it is in the initial or non-initial conjunct: the ''e_id'' attribute refers to its identifier (the sequence number of the token representing the head within the sentence), the ''eparent'' attribute to its position  relative to the token. 
   * In InterCorp [[en:cnk:intercorp:verze16ud|release 16ud]], there is an additional ''e_deprel'' attribute whose value equals ''deprel'' of the given token, except when the token is a non-initial conjunct, i.e. when its ''deprel'' equals ''conj''. Then the value of ''e_deprel'' equals the value of ''p_deprel'', i.e. shows the syntactic function of the whole coordination.   * In InterCorp [[en:cnk:intercorp:verze16ud|release 16ud]], there is an additional ''e_deprel'' attribute whose value equals ''deprel'' of the given token, except when the token is a non-initial conjunct, i.e. when its ''deprel'' equals ''conj''. Then the value of ''e_deprel'' equals the value of ''p_deprel'', i.e. shows the syntactic function of the whole coordination.
 +  * The ''e_deprel'' attribute has the same value as ''p_deprel'' also when the ''deprel'' attribute equals ''fixed'', ''flat'', ''compound'' or ''list''. Tokens within such constructions can also be found using the syntactic function of the whole construction, i.e. the ''e_deprel'' attribute.   
   * To find all words with a certain syntactic function, including those that are part of a coordination, in InterCorp [[en:cnk:intercorp:verze13ud|release 13ud]], where the ''e_deprel'' attribute is not available, the solution is to use the ''p_deprel'' attribute. This attribute shows the syntactic function of the token's head. For example, a query for all direct objects, including coordinated ones, can be formulated using the disjunction operator (%%|%%) as follows: ''%%[deprel="obj" | deprel="conj" & p_deprel="obj"]%%''   * To find all words with a certain syntactic function, including those that are part of a coordination, in InterCorp [[en:cnk:intercorp:verze13ud|release 13ud]], where the ''e_deprel'' attribute is not available, the solution is to use the ''p_deprel'' attribute. This attribute shows the syntactic function of the token's head. For example, a query for all direct objects, including coordinated ones, can be formulated using the disjunction operator (%%|%%) as follows: ''%%[deprel="obj" | deprel="conj" & p_deprel="obj"]%%''
 ===== UD and KonText ===== ===== UD and KonText =====
Line 247: Line 248:
   * [[https://www.korpus.cz/kontext/view?q=~MwKKiaMYIgcg|This query]] finds indirect objects.   * [[https://www.korpus.cz/kontext/view?q=~MwKKiaMYIgcg|This query]] finds indirect objects.
   * The lemma of the indirect object's head can be listed using frequency distribution according to the attribute ''p_lemma''.   * The lemma of the indirect object's head can be listed using frequency distribution according to the attribute ''p_lemma''.
 +  * Note that in UD, dative complements in languages such as German or Czech are non-core dependents. As such, they should be labelled as ''%%deprel="obl"%%'' or (preferably but not obligatorily) ''%%deprel="obl:arg"%%''. For more details see [[https://universaldependencies.org/u/overview/syntax.html#core-arguments-vs-oblique-modifiers|Core Arguments vs. Oblique Modifiers]].
  
  
 === Direct or indirect objects, also as conjuncts === === Direct or indirect objects, also as conjuncts ===
  
-<code>[deprel="i?obj" | deprel="conj" & p_deprel="i?obj"]</code>+<code>[e_deprel="i?obj"]</code>
  
-  * [[https://www.korpus.cz/kontext/view?q=~TwkkE2u668ya|This query]] finds direct or indirect objects, even as non-initial conjuncts, e.g. in the sentence //In Trump, they have found a shameless **frontman** and TV **personality** who will do their bidding.//+  * [[https://www.korpus.cz/kontext/view?q=~ROysAM6KwymO|This query]] finds direct or indirect objects, even as non-initial conjuncts, e.g. in the sentence //In Trump, they have found a shameless **frontman** and TV **personality** who will do their bidding.//
   * Note that for coordinated constituents, a separate concordance is shown for each conjunct.   * Note that for coordinated constituents, a separate concordance is shown for each conjunct.
-  * Either the keyword's ''deprel'' denotes the direct or indirect object (''%%deprel="i?obj"%%'', or -- equivalently -- ''%%deprel="obj|iobj"%%''), or the keyword's ''deprel'' is ''conj'' (''%%deprel="conj"%%'') and depends on a direct or indirect object (''%%p_deprel="i?obj"%%''), i.e. it is the non-initial conjunct in a coordinated constituent functioning as direct or indirect object. 
- 
  
 +<code>[deprel="i?obj" | deprel="conj" & p_deprel="i?obj"]</code>
  
 +  * [[https://www.korpus.cz/kontext/view?q=~TwkkE2u668ya|This query]] should be used in 13ud, where the ''e_deprel'' attribute is not available.
 +  * Either the keyword's ''deprel'' denotes the direct or indirect object (''%%deprel="i?obj"%%'', or -- equivalently -- ''%%deprel="obj|iobj"%%''), or the keyword's ''deprel'' is ''conj'' (''%%deprel="conj"%%'') and depends on a direct or indirect object (''%%p_deprel="i?obj"%%''), i.e. it is the non-initial conjunct in a coordinated constituent functioning as direct or indirect object.
 +  * In 16ud we get the same result using the ''e_deprel'' attribute in a simpler query:
  
 === Proper nouns as subjects, also as conjuncts === === Proper nouns as subjects, also as conjuncts ===
Line 266: Line 270:
   * [[https://www.korpus.cz/kontext/view?q=~cuIC4msKMsAW|This query]] finds proper nouns as subjects, including non-initial conjuncts.   * [[https://www.korpus.cz/kontext/view?q=~cuIC4msKMsAW|This query]] finds proper nouns as subjects, including non-initial conjuncts.
   * Concordances include sentences such as //And what does **Crump** say?// or //“I never even saw her,” said **Pat**.//   * Concordances include sentences such as //And what does **Crump** say?// or //“I never even saw her,” said **Pat**.//
 +  * In 16ud, the same query can be simplified using the ''e_deprel'' attribute:
 +
 +<code>[e_deprel="nsubj" & upos="PROPN"]</code> 
  
 === Gerunds preceded by "with" as the marker === === Gerunds preceded by "with" as the marker ===
Line 325: Line 332:
 ===== Description of the list of attributes ===== ===== Description of the list of attributes =====
  
-  * In {{cnk:intercorp:ud_ic_attributes.pdf | Attribute list by language}}all attributes used in the corpus are listed.+  * In {{cnk:intercorp:ud_ic_attributes.pdf | Attribute list by language in 13ud}} or {{cnk:intercorp:ud_ic16ud_attributes.pdf | Attribute list by language in 16ud}} all attributes used in the specific version are listed.
   * Columns indicate whether the attribute is used for the language specified by the abbreviation in the header.   * Columns indicate whether the attribute is used for the language specified by the abbreviation in the header.
   * Attributes are divided into four categories, distinguished by background color.   * Attributes are divided into four categories, distinguished by background color.
 +  * For brevity, only linguistically annotated languages are included. E.g. the list for 16ud omits 14 languages denoted by the language codes bn, br, bs, eo, hs, ka, mk, ml, ms, rn, si, sq, th and tl. These languages can be queried Only the ''word'' and ''lc'' attributes can be used to query these languages.
  
 ==== Basic attributes ==== ==== Basic attributes ====
Line 341: Line 349:
 ==== Structural attributes ==== ==== Structural attributes ====
  
-  * These attributes are on the <fc #6495ed>light blue</fc> background.+  * These attributes are on the <fc #6495ed>light blue</fc> background.
   * They extend the reference to the token's syntactic governor (''head'') by additional attributes, making it easier to identify the head and its properties.   * They extend the reference to the token's syntactic governor (''head'') by additional attributes, making it easier to identify the head and its properties.
   * All attributes of this type are avaliable for all languages.   * All attributes of this type are avaliable for all languages.
Line 393: Line 401:
  
 Daniel Zeman: [[https://lectures.ms.mff.cuni.cz/view.php?rec=421|Reflexives in Universal Dependencies]]. Prague, 04/03/2019. Daniel Zeman: [[https://lectures.ms.mff.cuni.cz/view.php?rec=421|Reflexives in Universal Dependencies]]. Prague, 04/03/2019.
 +
 +==== About UD-annotated InterCorp ====
 +
 +Olga Nádvorníková (2024): Analyse contrastive de la complexité syntaxique à l’aide de corpus parallèles. Translitteræ, Laboratoire LATTICE (Langues, Textes, Traitements informatiques et Cognition) – CNRS UMR 8094 (Centre national de la recherche scientifique: Unité mixte de recherche), ENS (L'École normale supérieure). Paris, 28/05/2024. [[https://www.youtube.com/watch?v=wJrCez_XPQY|Video]], [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/C4%20Nadvornikova%20Analyse%20contrastiv%20e%20de%20la%20complexité%20syntaxique.pdf|slides]]
 +
 +Alexandr Rosen (2024): Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics. Instytut Slawistyki Zachodniej i Południowej, Uniwersytet Warszawski. Warszawa, 10/06/2024, [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/2024_UDCM_Wwa.pdf|slides]].
 +
 +Alexandr Rosen (2023). The InterCorp parallel corpus with a uniform annotation for all languages. Jazykovedný časopis, 74(1):254–265. [[https://www.juls.savba.sk/ediela/jc/2023/1/jc23-01.pdf|Paper]], [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/rosen-slovko-2023.pdf|slides]].
 +
 +