AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
en:pojmy:ud [2023/04/03 22:27] – [Multi-part tokens] alexandrrosenen:pojmy:ud [2023/04/04 10:22] – [Corpus Search] alexandrrosen
Line 82: Line 82:
   * Some tokens, in the UD parlance called **fused words**, or **aggregates** in some Czech corpus-related literature, consist of multiple parts. These parts correspond to different nodes in the syntactic structure. In English, such tokens represent **contractions**, consisting of a verb and the negative particle such as //isn't// or //cannot//   * Some tokens, in the UD parlance called **fused words**, or **aggregates** in some Czech corpus-related literature, consist of multiple parts. These parts correspond to different nodes in the syntactic structure. In English, such tokens represent **contractions**, consisting of a verb and the negative particle such as //isn't// or //cannot//
   * The orthographic form of these words is preserved in the corpus, the individual parts are separated only in the annotation - e.g. in the value of the ''lemma'' attribute, with the "|" sign as the separator. It is therefore possible to search for them like other words, by typing the full form into the search box in a simple query (e.g. //ses// in Czech, //can't// in English or //byłbym// in Polish), or in the advanced query using the CQL search language give the same strings as the value of the **''word''** attribute .   * The orthographic form of these words is preserved in the corpus, the individual parts are separated only in the annotation - e.g. in the value of the ''lemma'' attribute, with the "|" sign as the separator. It is therefore possible to search for them like other words, by typing the full form into the search box in a simple query (e.g. //ses// in Czech, //can't// in English or //byłbym// in Polish), or in the advanced query using the CQL search language give the same strings as the value of the **''word''** attribute .
-  * In some languages, including English and Czech, a part of the fused token has a different form when occurring in a different context as an orthographically separate word. E.g. //n't//, a part of //isn't//, corresponds to //not//, the Czech auxiliary clitic //s//, a part of //ses//, corresponds to //jsi//. Both variants are represented in the annotation: the **''iword''** attribute shows the original form ''is|n't'' or ''se|s'', while the **''sword''** attribute shows the unabreviated, "reconstructed" version: ''is|not'' or ''se|jsi''.((Aggregates are present in the following languages: ar, ca, cs, de, el, en, es, fi, fr, he, it, pl, pt, tr and uk. A list of all aggregates for a given language is displayed as the frequency distribution of word forms following the query %%[sword = ".|.+"]%%.))+  * In some languages, including English and Czech, a part of the fused token has a different form when occurring in a different context as an orthographically separate word. E.g. //n't//, a part of //isn't//, corresponds to //not//, the Czech auxiliary clitic //s//, a part of //ses//, corresponds to //jsi//. Both variants are represented in the annotation: the **''iword''** attribute shows the original form ''is|n't'' or ''se|s'', while the **''sword''** attribute shows the unabbreviated, "reconstructed" version: ''is|not'' or ''se|jsi''.((Aggregates are present in the following languages: ar, ca, cs, de, el, en, es, fi, fr, he, it, pl, pt, tr and uk. A list of all aggregates for a given language is displayed as the frequency distribution of word forms following the query %%[sword = ".|.+"]%%.))
   * In addition to the English tokens //isn't// (''is|n't'' – ''is|not'') or //cannot// (''can|not''),((The first form, preceding the dash, is the original form, i.e. the value of the ''iword'' attribute, the second form, after the dash, is the reconstructed form, i.e. the value of the ''sword'' attribute. If a parenthesis includes just one form, the two options are identical, or the given language does not provide reconstructed forms.)) in Czech there are tokens such as //abychom// (''a|bychom'' – ''aby|bychom''), //bylas// (''byla|s'' – ''byla|jsi'') or //oč// (''o|č'' – ''o|co''); in German //zur// (''zu|r'' – ''zu|der'') or //am// (''a|m'' – ''an|dem''); in Polish //miałam// (''miała|m''), //żebyś// (''że|by|ś'') or //chciałbym// (''chciał|by|m''); in French //des// (''de|s'' – ''de|les''), //aux// (''au|x'' – ''à|les'') or //auquel// (''au|quel'' – ''à|lequel'').   * In addition to the English tokens //isn't// (''is|n't'' – ''is|not'') or //cannot// (''can|not''),((The first form, preceding the dash, is the original form, i.e. the value of the ''iword'' attribute, the second form, after the dash, is the reconstructed form, i.e. the value of the ''sword'' attribute. If a parenthesis includes just one form, the two options are identical, or the given language does not provide reconstructed forms.)) in Czech there are tokens such as //abychom// (''a|bychom'' – ''aby|bychom''), //bylas// (''byla|s'' – ''byla|jsi'') or //oč// (''o|č'' – ''o|co''); in German //zur// (''zu|r'' – ''zu|der'') or //am// (''a|m'' – ''an|dem''); in Polish //miałam// (''miała|m''), //żebyś// (''że|by|ś'') or //chciałbym// (''chciał|by|m''); in French //des// (''de|s'' – ''de|les''), //aux// (''au|x'' – ''à|les'') or //auquel// (''au|quel'' – ''à|lequel'').
  
Line 97: Line 97:
   * In some languages, some deprels may have **subtypes**. The subtype name follows the colon after the deprel name, e.g. ''acl:relcl'' indicates an attribute expressed by a relative clause. The list below contains only subtypes relevant to English and represented in the corpus. Functions with subtypes for all languages are listed at [[https://universaldependencies.org/u/dep/index.html|Universal Dependency Relations]].   * In some languages, some deprels may have **subtypes**. The subtype name follows the colon after the deprel name, e.g. ''acl:relcl'' indicates an attribute expressed by a relative clause. The list below contains only subtypes relevant to English and represented in the corpus. Functions with subtypes for all languages are listed at [[https://universaldependencies.org/u/dep/index.html|Universal Dependency Relations]].
   * When querying a deprel that may have a subtype, a possible subtype should be taken into account. For example, to find all words with the deprel ''acl'', whether or not the deprel has a subtype, use the expression ''%%deprel="acl.*"%%'' instead of ''%%deprel="acl"%%''. To find all auxiliary verbs, use the expression ''%%deprel="aux.*"%%'' instead of ''%%deprel="aux"%%''. To find all subjects, use the expression ''%%deprel="nsubj.*"%%''.   * When querying a deprel that may have a subtype, a possible subtype should be taken into account. For example, to find all words with the deprel ''acl'', whether or not the deprel has a subtype, use the expression ''%%deprel="acl.*"%%'' instead of ''%%deprel="acl"%%''. To find all auxiliary verbs, use the expression ''%%deprel="aux.*"%%'' instead of ''%%deprel="aux"%%''. To find all subjects, use the expression ''%%deprel="nsubj.*"%%''.
-  * When a queried deprel targets a **coordinated structure**, only the first conjunct is found. The second and subsequent conjuncts are marked as ''%%deprel="conj"%%''. The syntactic function of the entire coordination is thus specified by the ''deprel'' attribute of the first cunjunct, the head of all other conjuncts. To query the "true" deprel of a non-initial conjunct (''%%deprel="conj"%%''), use the ''p_deprel'' attribute. See [[en:pojmy:ud#coordination|Coordination]] below for details.+  * When a queried deprel targets a **coordinated structure**, only the first conjunct is found. The second and subsequent conjuncts are marked as ''%%deprel="conj"%%''. The syntactic function of the entire coordination is thus specified by the ''deprel'' attribute of the first conjunct, the head of all other conjuncts. To query the "true" deprel of a non-initial conjunct (''%%deprel="conj"%%''), use the ''p_deprel'' attribute. See [[en:pojmy:ud#coordination|Coordination]] below for details.
  
  
Line 181: Line 181:
 === Basic query === === Basic query ===
  
-  * A basic query for a word form or phrase is entered in the same way as in previous releases of InterCorp.((In a basic query, it is no longer necessary in some languages to separate parts of the aggregate with a space, eg //był//, //by//, and  //m// of the Polish agglutinated form //byłbym // or //is// and //n't// of the English contraction //isn't//, even in a longer expression (//aren't I//). However, a basic query for //is// or //n't// will not show concordances including the for //isn't//.))+  * A basic query for a word form or phrase is entered in the same way as in previous releases of InterCorp.((In a basic query, it is no longer necessary in some languages to separate parts of the aggregate with a space, eg //był//, //by//, and  //m// of the Polish agglutinated form //byłbym // or //is// and //n't// of the English contraction //isn't//, even in a longer expression (//aren't I//). However, a basic query for //is// or //n't// will not show concordances including the form //isn't//.))
  
 === Query for a lemma and a morphological tag === === Query for a lemma and a morphological tag ===