AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze13ud [2022/06/13 19:30] – [Corpus Search] Alexandr Rosenen:cnk:intercorp:verze13ud [2023/04/03 16:42] (current) – [Texts in the corpus] Alexandr Rosen
Line 23: Line 23:
 InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus.  A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]]. InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus.  A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]].
  
-After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[martin.vavrin@ff.cuni.cz|Martin Vavřín]] if you are interested.+After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[alexandr.rosen@ff.cuni.cz|Alexandr Rosen]] if you are interested.
  
 New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the  [[https://universaldependencies.org|Universal Dependencies]] scheme. New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the  [[https://universaldependencies.org|Universal Dependencies]] scheme.
Line 30: Line 30:
  
   * In release 13ud, out of the total number of 41 languages ​​(including Czech), **36 are linguistically annotated**; in addition, all such languages ​​are **syntactically annotated**.   * In release 13ud, out of the total number of 41 languages ​​(including Czech), **36 are linguistically annotated**; in addition, all such languages ​​are **syntactically annotated**.
-  * Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org | Universal Dependencies]]). +  * Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org|Universal Dependencies]]). 
-  * General guidelines for annotation are provided on the UD project website ([[https://universaldependencies.org/guidelines.html|UD Guidelines]]), including a detailed description of+  * For a detailed description of UD as used in the annotation of InterCorp see [[en:pojmy:ud|Universal Dependencies]].
-    * word types ([[https://universaldependencies.org/u/pos/index.html|Universal POS tags]]) +
-    * morphological categories ([[https://universaldependencies.org/u/feat/index.html|Universal features]]+
-    * syntactic functions ([[https://universaldependencies.org/u/dep/index.html|Universal Dependency Relations]])+
   * Annotation was performed for all languages ​​by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((The tool uses all data for the given language, ie all treebanks listed on [[https://lindat.mff.cuni.cz/services/udpipe/IUDPipe]]. Annotation of this release used the following models: arabic-padt-ud-2.6-200830,   * Annotation was performed for all languages ​​by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((The tool uses all data for the given language, ie all treebanks listed on [[https://lindat.mff.cuni.cz/services/udpipe/IUDPipe]]. Annotation of this release used the following models: arabic-padt-ud-2.6-200830,
 belarusian-hse-ud-2.6-200830, belarusian-hse-ud-2.6-200830,
Line 71: Line 68:
 ukrainian-iu-ud-2.6-200830, ukrainian-iu-ud-2.6-200830,
 vietnamese-vtb-ud-2.6-200830.)) vietnamese-vtb-ud-2.6-200830.))
-  * In other releases of InterCorp, word class and morphological categories of a word are specified as the value of the ''tag'' attribute. For most languages, InterCorp release 13ud retains these language-specific tags in the ''xpos'' attribute. However, the UD **word class** and **morphological categories**, denoted uniformly for all languages, are listed separately as values of the ''upos'' and ''feats'' attributes (see below [[en:cnk:intercorp:verze13ud#parts_of_speech|Parts of speech]], and [[en:cnk:intercorp:verze13ud#other_categories|Other categories]], respectively). Frequently used morphological categories from the ''feats'' list have been promoted to the status of regular attributes at the same level as ''upos''. This applies, for example, to morphological case, number, gender or person (''case'', ''number'', ''gender'', ''person'').  
-  * For use in KonText, **fused forms** or //aggregates//, ie word forms composed of two or even three syntactic words, were modified as divided tokens. In English it concerns, for example, the forms //isn't// or //cannot//. For more details see [[en:cnk:intercorp:verze13ud#multi-part_tokens|Multi-part tokens]] below. 
-  * Each word is assigned its **syntactic function** (''deprel'' – see [[en:cnk:intercorp:verze13ud#syntactic_functions|Syntactic functions]]) and its syntactic governor in the dependency tree (''head''). To facilitate orientation in the syntactic structure, each word is also annotated with references to important properties of its head (lemma, part of speech and morphological categories), see [[en:cnk:intercorp:verze13ud#references_to_syntactic_heads|References to syntactic head]]. If a content word occurs with a **function word** (eg. preposition, auxiliary verb, subordinate conjunction, determiner), the content word includes some properties of the function word (see [[en:cnk:intercorp:verze13ud#references_to_function_words|References to function words]]). 
-  * **Annotations between languages ​​differ** in the number of categorial attributes and in links to function words, see {{cnk:intercorp:ud_ic_attributes.pdf | List of attributes by language}}, described below in [[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#description_of_the_list_of_attributes|Description of the list of attributes]]. 
-  * KonText makes supports queries by word class and other morphological categories using the **''Insert tag''** function, which inserts a UD POS (''upos'') and any category from the ''feats'' list into the query. The ''Insert tag'' feature is available for all linguistically annotated languages. 
- 
 ===== Texts in the corpus ===== ===== Texts in the corpus =====
  
Line 89: Line 80:
   * Translations of the Bible   * Translations of the Bible
  
-These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.+These texts have been aligned automatically: search results may include a higher number of misaligned segments. Moreover, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
  
 Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words. Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.
Line 119: Line 110:
 ^  hr  ^ Croatian |  21,923 |  0 |  0 |  0 |  0 |  19,048 |  571 |  41,543 | ^  hr  ^ Croatian |  21,923 |  0 |  0 |  0 |  0 |  19,048 |  571 |  41,543 |
 ^  hu  ^ Hungarian |  6,444 |  0 |  0 |  17,852 |  12,198 |  21,115 |  0 |  57,609 | ^  hu  ^ Hungarian |  6,444 |  0 |  0 |  17,852 |  12,198 |  21,115 |  0 |  57,609 |
-^  is  ^ Icelandic |  0 |  0 |  0 |  0 |  0 |  1,581 |  0 |  1,581 |+^  //is//  //Icelandic// |  0 |  0 |  0 |  0 |  0 |  1,581 |  0 |  1,581 |
 ^  it  ^ Italian |  14,525 |  1,252 |  2,747 |  23,771 |  15,494 |  14,700 |  684 |  73,174 | ^  it  ^ Italian |  14,525 |  1,252 |  2,747 |  23,771 |  15,494 |  14,700 |  684 |  73,174 |
 ^  ja  ^ Japanese |  2,189 |  0 |  0 |  0 |  0 |  477 |  0 |  2,666 | ^  ja  ^ Japanese |  2,189 |  0 |  0 |  0 |  0 |  477 |  0 |  2,666 |
 ^  lt  ^ Lithuanian |  421 |  0 |  0 |  17,316 |  11,213 |  558 |  471 |  29,979 | ^  lt  ^ Lithuanian |  421 |  0 |  0 |  17,316 |  11,213 |  558 |  471 |  29,979 |
 ^  lv  ^ Latvian |  2,646 |  0 |  0 |  17,522 |  11,682 |  280 |  537 |  32,667 | ^  lv  ^ Latvian |  2,646 |  0 |  0 |  17,522 |  11,682 |  280 |  537 |  32,667 |
-^  mk  ^ Macedonian |  8,881 |  0 |  0 |  0 |  0 |  1,877 |  0 |  10,758 | +^  //mk//  //Macedonian// |  8,881 |  0 |  0 |  0 |  0 |  1,877 |  0 |  10,758 | 
-^  ms  ^ Malay |  0 |  0 |  0 |  0 |  0 |  3,521 |  0 |  3,521 |+^  //ms//  //Malay// |  0 |  0 |  0 |  0 |  0 |  3,521 |  0 |  3,521 |
 ^  mt  ^ Maltese |  0 |  0 |  0 |  13,935 |  0 |  0 |  0 |  13,935 | ^  mt  ^ Maltese |  0 |  0 |  0 |  13,935 |  0 |  0 |  0 |  13,935 |
 ^  nl  ^ Dutch |  16,216 |  813 |  2,953 |  23,416 |  15,558 |  29,373 |  717 |  89,045 | ^  nl  ^ Dutch |  16,216 |  813 |  2,953 |  23,416 |  15,558 |  29,373 |  717 |  89,045 |
Line 131: Line 122:
 ^  pl  ^ Polish |  26,200 |  0 |  2,380 |  19,604 |  12,817 |  26,576 |  583 |  88,161 | ^  pl  ^ Polish |  26,200 |  0 |  2,380 |  19,604 |  12,817 |  26,576 |  583 |  88,161 |
 ^  pt  ^ Portuguese |  4,981 |  554 |  2,782 |  24,598 |  15,193 |  41,468 |  706 |  90,282 | ^  pt  ^ Portuguese |  4,981 |  554 |  2,782 |  24,598 |  15,193 |  41,468 |  706 |  90,282 |
-^  rn  ^ Romani |  14 |  0 |  0 |  0 |  0 |  0 |  0 |  14 |+^  //rn//  //Romani// |  14 |  0 |  0 |  0 |  0 |  0 |  0 |  14 |
 ^  ro  ^ Romanian |  4,219 |  0 |  2,738 |  8,092 |  9,446 |  34,128 |  0 |  58,622 | ^  ro  ^ Romanian |  4,219 |  0 |  2,738 |  8,092 |  9,446 |  34,128 |  0 |  58,622 |
 ^  ru  ^ Russian |  8,642 |  3,984 |  0 |  0 |  0 |  6,887 |  565 |  20,078 | ^  ru  ^ Russian |  8,642 |  3,984 |  0 |  0 |  0 |  6,887 |  565 |  20,078 |
 ^  sk  ^ Slovak |  8,543 |  0 |  0 |  18,399 |  12,727 |  5,133 |  561 |  45,363 | ^  sk  ^ Slovak |  8,543 |  0 |  0 |  18,399 |  12,727 |  5,133 |  561 |  45,363 |
 ^  sl  ^ Slovene |  3,871 |  0 |  0 |  18,528 |  12,251 |  17,061 |  0 |  51,711 | ^  sl  ^ Slovene |  3,871 |  0 |  0 |  18,528 |  12,251 |  17,061 |  0 |  51,711 |
-^  sq  ^ Albanian |  0 |  0 |  0 |  0 |  0 |  2,003 |  0 |  2,003 |+^  //sq//  //Albanian// |  0 |  0 |  0 |  0 |  0 |  2,003 |  0 |  2,003 |
 ^  sr  ^ Serbian |  11,582 |  0 |  0 |  0 |  0 |  20,727 |  0 |  32,308 | ^  sr  ^ Serbian |  11,582 |  0 |  0 |  0 |  0 |  20,727 |  0 |  32,308 |
 ^  sv  ^ Swedish |  15,790 |  0 |  0 |  19,542 |  13,784 |  14,666 |  638 |  64,419 | ^  sv  ^ Swedish |  15,790 |  0 |  0 |  19,542 |  13,784 |  14,666 |  638 |  64,419 |
Line 147: Line 138:
 ^ **TOTAL**  ^|  441,725 |  31,967 |  26,968 |  425,543 |  276,772 |  539,774 |  12,066 |  1,754,815 | ^ **TOTAL**  ^|  441,725 |  31,967 |  26,968 |  425,543 |  276,772 |  539,774 |  12,066 |  1,754,815 |
  
-N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. +N.B. 1: Languages printed in //italics// have no linguistic annotation.
- +
-===== Morphological annotation ===== +
- +
-==== Parts of speech ==== +
- +
-  * In UD, part of speech is listed **separately from other categories** as the value of the **''upos''** attribute.  +
-  * Parts of speech given in ''upos'' are the **same for all languages**. +
-  * In addition to ''upos'', most languages provide a **language-specific morphological tag**, as the value of the **''xpos''** attribute. The ''xpos'' value is usually identical to a corresponding tag from the other, non-UD-based versions of InterCorp.  +
- +
- +
-^ upos ^ gloss ^ +
-| ADJ | [[ https://universaldependencies.org/u/pos/ADJ.html |  adjective ]] | +
-| ADP | [[ https://universaldependencies.org/u/pos/ADP.html |  adposition (incl. preposition) ]] | +
-| ADV | [[ https://universaldependencies.org/u/pos/ADV.html |  adverb ]] | +
-|AUX | [[ https://universaldependencies.org/u/pos/AUX.html |  auxiliary verb ]] | +
-|CCONJ | [[ https://universaldependencies.org/u/pos/CCONJ.html | coordinating conjuction ]] | +
-|DET | [[ https://universaldependencies.org/u/pos/DET.html |  determiner ]] | +
-|INTJ | [[ https://universaldependencies.org/u/pos/INTJ.html |  interjection]] | +
-|NOUN | [[ https://universaldependencies.org/u/pos/NOUN.html |  noun ]] | +
-|NUM | [[ https://universaldependencies.org/u/pos/NUM.html |  numeral ]] | +
-|PART | [[ https://universaldependencies.org/u/pos/PART.html |  particle]] | +
-|PRON | [[ https://universaldependencies.org/u/pos/PRON.html |  pronoun ]] | +
-|PROPN | [[ https://universaldependencies.org/u/pos/PROPN.html |  proper noun ]] | +
-|PUNCT | [[ https://universaldependencies.org/u/pos/PUNCT.html |  punctuation]] | +
-|SCONJ | [[ https://universaldependencies.org/u/pos/SCONJ.html |  subordinating conjunction ]] | +
-|SYM | [[ https://universaldependencies.org/u/pos/SYM.html |  symbol ]] | +
-|VERB | [[ https://universaldependencies.org/u/pos/VERB.html |  verb ]] | +
-|X | [[ https://universaldependencies.org/u/pos/X.html |  other]] | +
- +
- +
-==== Other categories ==== +
- +
-  * Other categories are embedded under the **''feats''** attribute. Their choice and values are determined by part of speech and language.  +
-  * Each category is listed as a "<category name>=<category value>" pair, e.g. ''Number=Sg''+
-  * Identical or comparable morphological categories and their values are called the same in all languages.  +
-  * A list of such pairs is the value of the ''feats'' attribute. +
-  * Categories in the ''feats'' attribute are separated by "|", e.g. the Russian form //школы// /'ʂkolɨ/ 'school' in genitive singular is marked as ''feats=%%"Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing"%%''.  +
-  * In an advanced query using the CQL query language each category can be specified separately: the Czech form //moře// 'sea' is one of the answers to the query ''%%[upos="NOUN" & feats="Number=Sing"]%%'' The Russian form is found follwoing the query ''[upos=%%"NOUN"%% & feats=%%"Gender=Fem"%% & feats=%%"Case=Gen"%%]''. The order of categories in the query is irrelevant.  +
-  * The value of ''feats'' can also be treated as a string of characters using regular expressions, e.g. ''[upos=%%"NOUN"%% & feats=%%".*Case=Gen.*Gender=Fem.*"%%]''. Here the order of categories in the query should correspond to their order in the corpus. The result is the same in both cases. +
-  * Some of the categories in ''feats'' are listed also outside the list as **categorial attributes** at the same level as ''upos''. As a result, a query for a singular noun can be simply as follows: ''%%[upos="NOUN" & number="Sing"]%%''. Similarly, the query  for the Russian form ''[upos=%%"NOUN"%% & %%gender="Fem"%% & %%case="Gen"%%]'' gives the same result as the two queries above. Categorial attributes can be also used to generate frequency lists.((Note that for technical reasons the names of the categorial attributes are all in lower case, including names such as ''VerbForm'' (in ''feats''), rendered as ''verb_form'', or ''NumType'', rendered as ''num_type''. The attribute values, such as ''Fem'', retain the initial upper case character, but are enclosed in double quotes, like other non-embedded attributes.))  Such attributes appear on the <fc #f4a460>light brown</fc> background in {{cnk:intercorp:ud_ic_atributy.pdf|Attribute list by language}} or in KonText in the lower part of the table shown in ''View'' / ''Corpus-specific settings...''+
- +
-^ category ^ gloss ^ example values ^ +
-| Abbr | [[ https://universaldependencies.org/u/feat/Abbr.html |abbreviation]] |Yes | +
-| Animacy | [[ https://universaldependencies.org/u/feat/Animacy.html |animacy]] |Anim, Inan, Hum, Nhum | +
-| Aspect | [[ https://universaldependencies.org/u/feat/Aspect.html |aspect]] |Imp, Perf, Hab, Iter, Prog, Prosp | +
-| Case | [[ https://universaldependencies.org/u/feat/Case.html |case]] |Nom, Gen, Dat, Acc, Voc, Loc, Ins, ... | +
-| Definite | [[ https://universaldependencies.org/u/feat/Definite.html |definiteness]] |Ind, Def, ... | +
-| Degree | [[ https://universaldependencies.org/u/feat/Degree.html |degree]] |Pos, Cmp, Sup, Equ, Abs | +
-| Foreign | [[ https://universaldependencies.org/u/feat/Foreign.html |foreign word]] |Yes | +
-| Gender | [[ https://universaldependencies.org/u/feat/Gender.html |gender]] |Fem, Masc, Neut, Com | +
-| Mood | [[ https://universaldependencies.org/u/feat/Mood.html |mood]] |Ind, Imp, Cnd, ... | +
-| NumType | [[ https://universaldependencies.org/u/feat/NumType.html |numeral type]] |Card, Ord, Mult, Frac, Sets, ... | +
-| Number | [[ https://universaldependencies.org/u/feat/Number.html |number]] |Sing, Plur, Dual, Ptan, Coll, ... | +
-| Person | [[ https://universaldependencies.org/u/feat/Person.html |person]] |1, 2, 3, ... | +
-| Polarity | [[ https://universaldependencies.org/u/feat/Polarity.html |polarity]] |Neg, Pos | +
-| Polite | [[ https://universaldependencies.org/u/feat/Polite.html |politeness]] |Infm, Form, Elev, Humb | +
-| Poss | [[ https://universaldependencies.org/u/feat/Poss.html |possessiveness]] |Yes | +
-| PronType | [[ https://universaldependencies.org/u/feat/PronType.html |type of pronoun etc.]] |Prs, Rcp, Art, Int, Rel, Exc, Dem, Emp, Tot, Ind | +
-| Reflex | [[ https://universaldependencies.org/u/feat/Reflex.html |reflexiveness]] |Yes | +
-| Tense | [[ https://universaldependencies.org/u/feat/Tense.html |tense]] |Pres, Past, Fut, Pqp, Imp | +
-| Typo | [[ https://universaldependencies.org/u/feat/Typo.html |typo]] |Yes | +
-| VerbForm | [[ https://universaldependencies.org/u/feat/VerbForm.html |verb form]] |Fin, Inf, Part, Conv, Ger, Vnoun, Sup | +
-| Voice | [[ https://universaldependencies.org/u/feat/Voice.html |voice]] |Act, Pass, Mid, Cau, ... | +
- +
- +
-==== Multi-part tokens ==== +
- +
-  * Some tokens, in the UD parlance called **fused words**, or **aggregates** in some Czech corpus-related literature, consist of multiple parts. These parts correspond to different nodes in the syntactic structure. In English, such tokens represent **contractions**, consisting of a verb and the negative particle such as //isn't// or //cannot//.  +
-  * The orthographic form of these words is preserved in the corpus, the individual parts are separated only in the annotation - e.g. in the value of the ''lemma'' attribute, with the "|" sign as the separator. It is therefore possible to search for them like other words, by typing the full form into the search box in a simple query (e.g. //ses// in Czech, //can't// in English or //byłbym// in Polish), or in the advanced query using the CQL search language give the same strings as the value of the **''word''** attribute . +
-  * In some languages, including English and Czech, a part of the fused token has a different form when occuring in a different context as an orthographically separate word. E.g. //n't//, a part of //isn't//, corresponds to //not//, the Czech auxiliary clitic //s//, a part of //ses//, corresponds to //jsi//. Both variants are represented in the annotation: the **''iword''** attribute shows the original form ''is|n't'' or ''se|s'', while the **''sword''** attribute shows the unabreviated, "reconstructed" version: ''is|not'' or ''se|jsi''.((Aggregates are present in the following languages: ar, ca, cs, de, el, en, es, fi, fr, he, it, pl, pt, tr and uk. A list of all aggregates for a given language is displayed as the frequency distribution of word forms following the query %%[sword = ".|.+"]%%.)) +
-  * In addition to the English tokens //isn't// (''is|n't'' – ''is|not'') or //cannot// (''can|not''),((The first form, preceding the dash, is the original form, i.e. the value of the ''iword'' attribute, the second form, after the dash, is the reconstructed form, i.e. the value of the ''sword'' attribute. If a parenthesis includes just one form, the two options are identical, or the given language does not provide reconstructed forms.)) in Czech there are tokens such as //abychom// (''a|bychom'' – ''aby|bychom''), //bylas// (''byla|s'' – ''byla|jsi'') or //oč// (''o|č'' – ''o|co''); in German //zur// (''zu|r'' – ''zu|der'') or //am// (''a|m'' – ''an|dem''); in Polish //miałam// (''miała|m''), //żebyś// (''że|by|ś'') or //chciałbym// (''chciał|by|m''); in French //des// (''de|s'' – ''de|les''), //aux// (''au|x'' – ''à|les'') or //auquel// (''au|quel'' – ''à|lequel''). +
- +
-===== Syntactic annotation ===== +
- +
-==== Syntactic functions ==== +
- +
-  * Each token specifies its syntactic function, i.e. dependency relation (''deprel'') and a reference to its syntactic governor (''head''). +
-  * The table below distinguishes four types of syntactic functions by different typeface: +
-     * Common deprels are listed in **bold**. +
-     * Deprels of function words are listed in //**bold italics**//+
-     * Deprels for representing coordination and similar phenomena in the dependency structure or for a technical purpose are set in //italics//+
-     * Deprels not used in English are listed in <fc #c0c0c0>in gray</fc>+
-  * In some languages, some deprels may have **subtypes**. The subtype name follows the colon after the deprel name, e.g. ''acl:relcl'' indicates an attribute expressed by a relative clause. The list below contains only subtypes relevant to English and represented in the corpus. Functions with subtypes for all languages are listed at [[https://universaldependencies.org/u/dep/index.html|Universal Dependency Relations]]. +
-  * When querying a deprel that may have a subtype, a possible subtype should be taken into account. For example, to find all words with the deprel ''acl'', whether or not the deprel has a subtype, use the expression ''%%deprel="acl.*"%%'' instead of ''%%deprel="acl"%%''. To find all auxiliary verbs, use the expression ''%%deprel="aux.*"%%'' instead of ''%%deprel="aux"%%''. To find all subjects, use the expression ''%%deprel="nsubj.*"%%''+
-  * When a queried deprel targets a **coordinated structure**, only the first conjunct is found. The second and subsequent conjuncts are marked as ''%%deprel="conj"%%''. The syntactic function of the entire coordination is thus specified by the ''deprel'' attribute of the first cunjunct, the head of all other conjuncts. To query the "true" deprel of a non-initial conjunct (''%%deprel="conj"%%''), use the ''p_deprel'' attribute. See [[https://wiki.korpus.cz/doku.php/cnk:intercorp:verze13ud#koordinace|Coordination]] below for details. +
- +
- +
- +
-^ deprel ^ gloss ^ example((The constituent performing the given function is highlighted. If the constituent consists of more than one word, the constituent's governor (head word) is underlined. It is this token which is annotated by the given function.)) ^ +
-| **acl** | [[https://universaldependencies.org/u/dep/acl.html | adnominal clause, finite or non-finite]] | //The convent of the Poor Clares, **__known__ as the Minories**, was destroyed to make way for storehouses.//+
-| **acl:relcl** | [[https://universaldependencies.org/u/dep/acl-relcl.html | relative adnominal clause ]] | //London has always been a vast ocean **in which survival is not __certain__**.//+
-| **advcl** | [[https://universaldependencies.org/u/dep/advcl.html | adverbial clause ]] | //The country will pay a heavy price **if the president’s obsessions __prevail__ for long**.// | +
-| **advmod** | [[https://universaldependencies.org/u/dep/advmod.html | adverbial modifier ]] | //They were **all** corrupt opportunists. Gorshkov knew **where** that idea came from .// | +
-| **amod** | [[https://universaldependencies.org/u/dep/amod.html | adjectival modifier ]] | //The **sustainable** future of humanity is at stake.// | +
-| **appos** | [[https://universaldependencies.org/cs/dep/appos.html | apposition ]] | //They were going to a new home, **a __house__ of her choosing**.//+
-| //**aux**// | [[https://universaldependencies.org/ru/dep/aux_.html | auxiliary verb ]] | //We **have** made our voice heard by the world. It**'s** going to work. You **can't** start improvising now.// | +
-| //**aux:pass**// | [[https://universaldependencies.org/u/dep/aux-pass.html | passive auxiliary ]] | //Men like that **are** born only once. Who else should I **get** dressed up for if not her?// | +
-| //**case**// | [[https://universaldependencies.org/u/dep/case.html | case marking (incl. preposition) ]] | // Karpov**'s** own career might hang **in** the balance // | +
-| //**cc**// | [[https://universaldependencies.org/u/dep/cc.html | coordinating conjunction ]] | // I now invite you all to eat, drink, **and** make yourselves at home! // | +
-| //**cc:preconj**// | [[https://universaldependencies.org/u/dep/cc-preconj.html | preconjunct ]] | //They are poisoning **both** the water and the soil.// | +
-| **ccomp** | [[https://universaldependencies.org/u/dep/ccomp.html | clausal complement ]] | //I doubt **whether the new model is an __improvement__**.//+
-| <fc #c0c0c0>clf</fc> | [[https://universaldependencies.org/u/dep/clf.html | classifier]] | 三**个**学生 // sān **gè** xuéshēng // | +
-| //compound// | [[https://universaldependencies.org/u/dep/compound.html | compound ]] | //In Gondor **ten** thousand years would not suffice.// | +
-| //compound:prt// | [[https://universaldependencies.org/u/dep/compound-prt.html | phrasal verb particle ]] | //He laid **out** the city’s streets and rebuilt its walls.// | +
-| //conj// | [[https://universaldependencies.org/u/dep/conj.html | non-initial conjunct ]] | // You have two parents and **you always will __have__**.//+
-| //**cop**// | [[https://universaldependencies.org/u/dep/cop.html | copula ]] | //Where**'s** the rest of your luggage?// | +
-| **csubj** | [[https://universaldependencies.org/u/dep/csubj.html | clausal subject, finite or nonfinite ]] | //It's quite easy **to __clear__ up these contradictions**. But the most important thing is **you shouldn't __lose__ too much time**.// | +
-| **csubj:pass** | [[https://universaldependencies.org/u/dep/csubj-pass.html | clausal subject of passive clause]] | //**__Taking__ notes** has been banned.// | +
-| //dep// | [[https://universaldependencies.org/u/dep/dep.html | unspecified dependency ]] | // By the 1860**s**, the South was utterly flush with cash. My dad doesn't really **not that __good__**. // |  +
-| //**det**// | [[https://universaldependencies.org/u/dep/det.html | determiner]] | //**What** way they went I don’t know and **no** rabbit knows .// | +
-| //**det:predet**// | [[https://universaldependencies.org/en/dep/det-predet.html | predeterminer]] | //People get sick **all** the time.// | +
-| **discourse** | [[https://universaldependencies.org/cs/dep/discourse.html | discourse element ]] | // ‘**Yes**, **please**,’ said Ron. **Oh** **dear**, what a bore!// | +
-| **dislocated** | [[https://universaldependencies.org/en/dep/dislocated.html | dislocated elements ]] | // **Dumplings** I like.// | +
-| **expl** | [[https://universaldependencies.org/u/dep/expl.html | expletive ]] | // **There** is a ghost in the room.// |  +
-| //fixed// | [[https://universaldependencies.org/u/dep/fixed.html | non-initial parts of fixed multiword unit]] | // At **least** there's one of you brave enough! Of **course** there may be exceptions.//+
-| //flat// | [[https://universaldependencies.org/u/dep/flat.html | non-initial parts of flat multiword unit ]] | //  Let's go to San **Francisco**. What was Miss **O'Hara** up to? // | +
-| //flat:foreign// | [[https://universaldependencies.org/u/dep/flat-foreign.html | non-initial parts of flat multiword unit ]] | //During the colonial period it was called the Portal **de los Mercaderes** .// | +
-| //goeswith// | [[https://universaldependencies.org/u/dep/goeswith.html | non-initial parts of incorrectly split form ]] | // They come here with **out** legal permission. // | +
-| **iobj** | [[https://universaldependencies.org/u/dep/iobj.html | indirect object ]] | //He brought **us** eggs. Can I buy **you** a drink?// | +
-| //list// | [[https://universaldependencies.org/u/dep/list.html | non-initial parts of list ]] | //Steve Jones **tel.: 555-9814 e-mail: jones@abc.edf**//+
-| //**mark**// | [[https://universaldependencies.org/u/dep/mark.html | marker ]] | //I spent the night telling jokes **to** keep Petrik **from** falling asleep at the wheel. I just want **to** know what you are thinking about **when** you wake up.// | +
-| **nmod** | [[https://universaldependencies.org/u/dep/nmod.html | nominal modifier ]] | //Did they put some fish near **the __infant__'s** grave for his journey **into the __afterlife__** ?// | +
-| **nmod:npmod** | [[https://universaldependencies.org/en/dep/nmod-npmod.html | noun phrase as adverbial modifier ]] | //He was younger then and **a __lot__** more agile. It seemed that everyone had trembling hands and **tear**-filled eyes.//| +
-| **nmod:poss** | [[https://universaldependencies.org/en/dep/nmod-poss.html | possessive nominal modifier ]] | //Many saw it as a good thing that **her** show was taken off the air.// | +
-| **nmod:tmod** | [[https://universaldependencies.org/en/dep/nmod-tmod.html | temporal modifier ]] | //In Plenary **today** I supported the amendment.//+
-| **nsubj** | [[https://universaldependencies.org/u/dep/obj.html | nominal subject]] | //**Those** **who** venture upon its currents look for prosperity or fame, even if **they** often founder in its depths.// | +
-| **nsubj:pass** | [[https://universaldependencies.org/u/dep/nsubj-pass.html | nominal subject of poassive clause]] | //The **horses** were adorned with just one red scarf.// |  +
-| **nummod** | [[https://universaldependencies.org/cs/dep/nummod.html | numeric modifier ]] | // Dissolution does but give birth to fresh modes of organization, and **one** death is the parent of a **thousand** lives.// | +
-| **obj** | [[https://universaldependencies.org/u/dep/obj.html | object ]] | // But who can stop the **people**? **What** do you mean? I don't know **what** to do. // | +
-| **obl** | [[https://universaldependencies.org/u/dep/obl.html | oblique nominal ]] | //We might bring an avalanche down **on __ourselves__** **for no good __reason__** .// | +
-| **obl:npmod** | [[https://universaldependencies.org/u/dep/obl.html | noun phrase as oblique nominal ]] | //I get fed up **a __little__** sometimes.//+
-| **obl:tmod** | [[https://universaldependencies.org/en/dep/obl-tmod.html | temporal modifier ]] | //I leave **tomorrow**. Tell him everything, **tonight**.//+
-| //orphan// | [[https://universaldependencies.org/u/dep/orphan.html | orphan after elided head ]] | // Mary won gold and Peter **bronze**. // | +
-| //parataxis// | [[https://universaldependencies.org/u/dep/parataxis.html | parataxis (incl. parentheticals) ]] | // "Is that the only reason?" **she __asked__**, putting her eyes close to mine. // | +
-| //punct// | [[https://universaldependencies.org/u/dep/punct.html | punctuation]] | // Máte všecko**?** // | +
-| <fc #c0c0c0>reparandum</fc> | [[https://universaldependencies.org/u/dep/reparandum.html | overridden disfluency ]] | //Go **to the __right-__** to the left.//| +
-| **root** | [[https://universaldependencies.org/u/dep/root.html | root ]] | // This was not a good **moment** in the history of English cuisine. // | +
-| **vocative** | [[https://universaldependencies.org/cs/dep/vocative.html | vocative ]] | // See you later, **Sam**.// | +
-| **xcomp** | [[https://universaldependencies.org/u/dep/xcomp.html | open clausal complement ]] | //Maria saw me **__standing__ at the mirror**.// | +
- +
-==== References to syntactic heads ==== +
- +
-  * In addition to the pointer to its head (''head'' as the word ID of the head, i.e. its word order position within the sentence, or ''parent'' as its position relative to the given word), some other attributes of the head are listed for each token: lemma (''p_lemma''), POS (''p_upos''), morphological category (''p_feats''), and syntactic function (''p_deprel''). +
-  * A token may also have attributes that specify the properties of a fuction word that depends on the token. For example, the lemma of a preposition is shown by the attribute ''case_lemma'', morphological categories of an auxiliary by ''aux_feats'', morphological categories of a copula by ''cop_feats'', part of speech of a determiner by ''det_upos'', lemma of a marker by ''mark_lemma''+
-  * Similar means of representing syntactic structure are used by other syntactically annotated corpora available in the KonText browser (e.g. ''syn2020''). +
- +
-==== References to function words ====  +
- +
-  * According to UD, function words include auxiliary verbs, adpositions, subordinating conjunctions, conjunctions, determiners, and quantifiers. +
-  * Function words depend on the corresponding content words.   +
-  * Types of function words are specified by their syntactic function, i.e. by the value of the ''deprel'' attribute: ''aux'' (auxiliaries), ''case'' (prepositions), ''mark'' (markers), ''cop'' (copula), ''det'' (determiners), and ''clf'' (classifiers).   +
-  * For each function word the content word governor may include the function word's ''lemma'', ''upos'', ''feats'' and a more detailed specification of ''type'', e.g. ''%%aux_type="pass"%%'' (see [[https://universaldependencies.org/cs/dep/aux-pass.html|passive auxiliary]]), or ''%%det_type="numgov"%%'' (see [[https://universaldependencies.org/cs/dep/det-numgov.html|pronominal quantifier governing the case of the noun]]).  +
-  * The names of the corresponding content word attributes consist of the function word's ''deprel'' and attribute. For example, ''case_lemma'' specifies the lemma of the noun or pronoun's preposition, the ''aux_feats'' attribute of a content verb specifies morphological categories of its auxiliary. +
-  * A single content word can govern multiple function words, e.g. three for the passive present perfect conditional (//she would have been **pleased**//). The values of all the auxiliary words, separated by "''|''", then appear in the appropriate attribute. The ''feats'' attribute values from multiple auxiliary verbs dependent on a single meaning are combined into a single value where some categories, such as verb form specifications, may be repeated because they come from more than one form. For example, in the sentence //who would have **guessed** that//, the ''aux_feats'' of the content verb //guessed// are composed of the feats of the auxiliary verbs //would// (''%%Mood=Ind|Person=3|Tense=Past|VerbForm=Fin%%'') and //have// (''%%VerbForm=Inf%%''). +
- +
-==== Coordination ==== +
- +
-  * The first conjunct depends on the governor of the entire coordination. Its syntactic function determines the syntactic function of the whole coordination.  +
-  * The second and subsequent conjuncts always depend on the first conjunct. Their syntactic function is specified as ''conj''+
-  * Conjunctions depend on the following conjunct. Their syntactic function is ''cc''+
-  * A reference to the so-called effective head is used to identify the head regardless of whether the token is a conjunct or not, or whether it is in the initial or non-initial conjunct: the ''e_id'' attribute refers to its identifier (the sequence number of the token representing the head within the sentence), the ''eparent'' attribute to its position  relative to the token. +
-  * To find all words with a certain syntactic function, including those that are part of a coordination, use the ''p_deprel'' attribute. This attribute shows the syntactic function of the token's head. For example, a query for all indirect objects, including coordinated ones, can be formulated using the disjunction operator (%%|%%) as follows: ''%%[deprel="obj" | deprel="conj" & p_deprel="obj"]%%''+
-===== UD and KonText ===== +
- +
-==== Corpus Search ==== +
- +
-=== Basic query === +
- +
-  * A basic query for a word form or phrase is entered in the same way as in previous releases of InterCorp.((In a basic query, it is no longer necessary in some languages to separate parts of the aggregate with a space, eg //był//, //by//, and  //m// of the Polish agglutinated form //byłbym // or //is// and //n't// of the English contraction //isn't//, even in a longer expression (//aren't I//). However, a basic query for //is// or //n't// will not show concordances including the for //isn't//.)) +
- +
-=== Query for a lemma and a morphological tag === +
- +
-  * As in previous releases of InterCorp, a lemma and a morphological tag can be entered in an advanced query. For most linguistically annotated languages ​​(except be, da, en, fr, hu, no and ru) it is possible to enter a tag from a language-specific set (national tagset), usually identical to the set used in the previous releases of InterCorp for that language. Just use the ''xpos'' attribute instead of the ''tag'' attribute. E.g. the query on feminine nouns in the vocative singular in Czech can be entered as follows: %%[xpos = "NNFS5.*"]%%. +
-  * According to UD, part of speech and morphological categories are listed separately as values ​​of the attributes ''upos'' and ''feats'', respectively. Their values ​​can be entered using the ''Insert tag'' function.  +
-  * Parts of speech (''upos'') are the same for all languages. E.g. a query for proper names without using the ''Insert tag'' function can be specified as follows: %%[upos = "PROPN"]%%. +
-  * Other morphological categories are listed under the ''feats'' attribute. Some of them are available separately under categorial attributes. For details see [[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#other_categories | Other categories]] above.  +
- +
-=== Query for a part of speech and morphological categories using the menu === +
- +
-   * When entering an advanced query, you can use the ''Insert tag'' function, which lets you select the POS and/or the values of the relevant categories (properties) from the ''feats'' list in all linguistically annotated languages. The offer of properties for a given POS is determined by their actual occurrence in the corpus, so the list may reflect incorrect combinations. +
- +
-=== Query for a syntactic function === +
- +
-   * Syntactic function is specified for each token as the value of the ''deprel'' attribute. +
-   * E.g. a query to show the occurrences of the verb //běhat// 'run' in the function of the governor of an adnominal clause, is entered as %%[lemma="run" & deprel="acl"]%%. +
- +
- +
-==== Query results ==== +
- +
-=== Formatted text === +
- +
-  * After clicking on the keyword and ''Formatted text'' in the context box header, a concordance will appear along with the nearest context in a form that is close to the typography of the original text. For example, there are no spaces between the end of a word and punctuation, and paragraphs are separated by a blank line. +
- +
-=== Syntactic structure display === +
- +
-  * After clicking on the syntax tree icon at the beginning of each concordance line, the syntactic structure of the sentence is displayed. For each node, the word form, POS and syntactic function of the word relative to the given token are given. After clicking on the node, other annotation will appear, especially the lemma of the form. +
-  * Multi-part tokens (aggregates) are divided into multiple nodes and the word form then corresponds to the relevant part of the token (the ''iword'' attribute). After clicking on such a node, in addition to the lemma of the given part of the multi-word token, its full form (as a separate word, the ''sword'' attribute) and the word form of the entire token (''word'') also appear. +
-  * In the text line above the structure and in the structure, under the cursor the relevant strings and nodes are highlighted in parallel. +
- +
-==== Examples of queries ==== +
- +
-  * The queries assume the Czech subcorpus, except when stated otherwise. +
- +
-<code>[case_lemma="o" & case="Acc"]</code> +
-- Finds accusative nominals in with the preposition //o//. The governing verbs can be listed using frequency distribution according to the attribute ''p_lemma''+
- +
-<code>[deprel="obj" & case="Dat" | deprel="conj" & p_deprel="obj" & case="Dat"]</code> +
-- Finds dative objects, even non-initial conjuncts. +
- +
-<code>[deprel="nsubj" & upos="PROPN" | deprel="conj" & p_deprel="nsubj" & upos="PROPN"]</code>  +
-- Finds proper nouns as subjects, even non-initial conjuncts. +
- +
-<code>[upos="NOUN" & case="Ins" & deprel="obj" & p_feats="VerbForm=Inf"]</code> +
-- Finds nouns in the instrumental case as objects of an infinitive. The infinitives can be listed using frequency distribution according to the attribute ''p_lemma''+
- +
-<code>[feats="Gender=Neut" & feats="Number=Sing" & feats="Tense=Past" & feats="VerbForm=Part" & upos="VERB" & aux_feats="Person=1"]</code> +
-- Finds l-participles in neuter singular used with an auxiliary verb in the first person. The query for the participle was entered using the function ''Insert tag''. The same result is obtained by the following query, which uses categorial attributes outside the ''feats'' list: +
-<code>[gender="Neut" & number="Sing" & tense="Past" & verb_form="Part" & upos="VERB" & aux_feats="Person=1"]</code> +
- +
-<code>1:[lemma="vidět|slyšet"] []* 2:[case="Acc" & deprel="obj"] []* 3:[verb_form="Inf" & deprel="xcomp"] & 2.head=1.id & 3.head=1.id within <s/></code> +
-- Finds sentences with verbs //vidět// 'see' or //slyšet// 'hear' governing an accusative object and an infinitive ''xcomp''. There can be any number of other words between these tokens, but only within the sentence. +
- +
-<code>[voice="Act" & aux_feats="Mood=Cnd" & aux_feats="Tense=Past"]</code> +
-– Finds sentences including a verb in the active voice and past conditional mood, e.g. //Kdybych si nebyl oholil knír ...// 'If I hadn't shaved my moustache...' +
- +
-<code>[voice="Pass" & aux_feats="Mood=Cnd" & aux_feats=".*Tense=Past.*Tense=Past.*"]</code> +
-– Finds sentences including a verb in the passive voice and past conditional mood, e.g. //... aféra by byla bývala ututlána.// '... the scandal would have been hushed up.'((The form of the content verb used in the periphrastic passive has an adjectival lemma, e.g. //ututlaný// 'hushed', the adjectival POS ''upos=ADJ'' and its morphological categories include the features''%%feats="...Variant=Short|VerbForm=Part|Voice=Pass"%%''. On the other hand, reflexive passive, e.g. //oholil se// '[he] shaved himself', is annotated as ''%%feats="...Voice=Act"%%''.)) ((According to the UD guidelines, function words are immediate dependents on the relevant content word. In InterCorp 13ud, values of the ''feats'' attribute specified in multiple function words dependent on a single content word governor are concatenated into a single value. If so, categories such as Tense can occur more than once in the value of such a ''feats'' attribute, because it originates in two or more auxiliaries, as in our example from //byla// '[she] was' and //bývala// '[she] used to be'. This double occurrence is what the query uses to target the presence of two auxiliaries. If a query looking for passive voice verbs would mention only ''%%[aux_feats="Tense=Past"]%%'', the result would include also present conditional forms, where the l-ové participle (the ''%%"Tense=Past"%%'' form) occurs just bonce as the passive auxiliary (//... aféra by byla ututlána.// 'the scandle would be hushed up.'). ))  +
- +
-<code>[feats="VerbForm=Ger" & aux_feats="VerbForm=Fin" & aux_feats="VerbForm=Part"]</code> +
-– In English: finds sentences including continuous perfect forms (both present and past), e.g. //... has been constantly increasing in velocity//. +
- +
-===== Description of the list of attributes =====+
  
-  * In {{cnk:intercorp:ud_ic_attributes.pdf | Attribute list by language}}, all attributes used in the corpus are listed. +N.B2: Each Czech text is counted only onceeven though it may have more than one foreign counterpart.
-  * Columns indicate whether the attribute is used for the language specified by the abbreviation in the header. +
-  * Attributes are divided into four categoriesdistinguished by background color.+
  
-==== Basic attributes ==== 
  
-  * These 12 attributes are on the <fc #dda0dd>light purple</fc> background. +===== Acknowledgements =====
-  * They consist of the following items: word form, lemma, part of speech, morphological categories, token order in a sentence, head reference and syntactic function. +
-  * They are usually taken directly from the output of the tool [[https://ufal.mff.cuni.cz/udpipe|UDPipe]]. The format of the output is [[https://universaldependencies.org/format.html|CoNLL-U]]. +
-  * There are two added attributes: ''lc'' and ''lc_lemma'', which repeat word form and lemma without any capital letters. +
-  * For languages ​​with multipart tokens (aggregates), there are also two additional ''sword'' and ''iword'' attributes. +
-  * The ''sword'' attribute includes the word form of the aggregate split by the "|" character into parts corresponding to syntactic words as they occur outside an aggregate, e.g. for //nač// and //abychom// the values of ''sword'' equal ''na|co'' and ''aby|bychom''+
-  * The ''iword'' attribute splits the aggregate into parts without any modification, for the tokens //nač// and //abychom// the values of ''iword'' egual ''na|č'' and ''a|bychom''+
-  +
-==== Structural attributes ====+
  
-  * These 7 attributes are on the <fc #6495ed>light blue</fc> background. +We are grateful for the possibility to use the following texts and software:
-  * They extend the reference to the token's syntactic governor (''head'') by additional attributes, making it easier to identify the head and its properties. +
-  * All attributes of this type are avaliable for all languages.+
  
-==== Function word attributes ====+==== Texts: ====
  
-  * These attributes are on the <fc #9acd32>light green</fc> background+  * The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš
-  * They are given within the content word in order to specify the essential properties of the dependent function word+  * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to  Adrian Barentsen 
-  * The total number of function word attributes is 20, but no language uses them all+  * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] 
-  * Attributes refer to 6 types of auxiliary words, determined by their syntactic function in relation to the semantic word+  * Newspaper texts in a number of languages from the [[http://www.voxeurop.eu|Presseurop/VoxEurop]] server 
-  * For each function word, the lemma, part of speech, morphological categories and subtype of the function word can be specified+  * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus 
-  * An attribute name consists of the name of the function word's syntactic function and the name of its property (attribute)+  * Proceedings of the European Parliament from the [[http://www.statmt.org/europarl/|EuroParl]] corpus 
-  * Unused or uninformative attributes are absent for the given languageThere are four possible combinations which do not occur in any language+  * Slovak-Czech concordances from the [[http://korpus.juls.savba.sk/|Slovak National Corpus]]  
-  * Most languages ​​(35) use the attribute ''case_lemma'' (lemma of apposition, most often prepositions), followed by ''mark_lemma'' (lemma of subordinate conjunctions, in 33 languages). +  * Short stories in a number of languages [[http://www.goethe.de/ins/cz/prj/m89/csindex.htm|My 1989]] from [[http://www.goethe.de/ins/cz/pra/|Goethe Institut]]  
-  * The ''clf_lemma'' (lemma of classifier) ​​attribute only appears in Chinese+  * A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness 
-  * If there are several auxiliaries of the same type for a content word, their values ​​are separated by the "|" character.+  * George Orwell's novel //1984// in a number of languages from the [[http://nl.ijs.si/ME/|Multext-East]] corpus  
 +  * Ukrainian and Polish texts from the [[http://www.domeczek.pl/~polukr/|PolUkr]] corpus  
 +  * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug &amp; co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] 
 +  * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] 
  
-==== Attributes representing selected categories ====+==== Pre-processing ====
  
-  * On the <fc #f4a460>light brown</fc> background, there is a selection of 18 attributes from the ''feats'' list+  * Parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička 
-  * Only Latvian uses them all, while Maltese uses noneIn addition to the language type, their presence or absence also depends on the availability of the category in the UD data.+  * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]] 
 +  * Sentence splitter for Czech by Pavel Květoň 
 +  * Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička 
 +  * Sentence splitter Punkt for all other languages from [[http://www.nltk.org/|Natural Language Toolkit]]
  
-===== Errors and shortcomings of linguistic annotation according to UD =====+==== Linguistic annotation ====
  
-   POS and morphological categories do not match +[[http://ufal.mff.cuni.cz/udpipe|UDPipe]] (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel)
-   * Inconsistencies in the application of the principles of uniform classification of phenomena in all languages +
-   * Errors and inconsistencies in the given language (e.g. //udělals// as a unitary token)+
  
-The quality of annotations in different languages differs mainly in the volume and quality of training data. It is also affected by the method and tool used for annotation. 
  
-We will be grateful for every reported error, discrepancy, deficiency, comment and suggestion at the address [[https://podpora.korpus.cz/projects/paralelni-korpus-intercorp|CNC user support]]. +===== How to cite =====
-Please include the abbreviation "UD" at the beginning of the message subject.+
  
-===== References =====+If you publish results based on InterCorp we would appreciate a link to the project site [[https://intercorp.korpus.cz/|www.intercorp.korpus.cz]]. In your scientific publications please cite the following paper: 
  
-==== Selection of literature about UD ====+<WRAP round info 50%> 
 +Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427 
 +([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). 
  
-Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel Zeman (2021): [[https://doi.org/10.1162/coli_a_00402|Universal Dependencies]]. In: //Computational Linguistics//, ISSN 1530-9312, vol47, no2, pp255-308.+For more references see the [[https://www.korpus.cz/biblio|repository of bibliographical items based on the CNC]]. All references to work based on InterCorp are welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details.
  
-Daniel Zeman (2018): [[https://ufal.mff.cuni.cz/books/2018-zeman|The World of TokensTags and Trees]]ISBN 978-80-88132-09-7.+When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus descriptione.gas:
  
-For a complete listsee [[https://universaldependencies.org/introduction.html#ud-related-publications|here]].+RosenA., Vavřín, M., Zasina, A. J. (2022). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 13ud of 22 December 2021//. Institute of the Czech National Corpus, Charles University, Prague 2021. Available on-line: https://kontext.korpus.cz/
  
-==== Tutorials and lectures about UD ====+</WRAP>
  
-Daniel Zeman: [[https://www.youtube.com/watch?v=xUmZ8Mxcmg0|Universal Dependencies and the Slavic Languages]]. Warsaw, 19.11.2018. 
  
-Joakim Nivre, Daniel Zeman, Filip Ginter, Francis M. Tyers: [[http://universaldependencies.org/eacl17tutorial/adding.pdf|Tutorial on Universal Dependencies: Adding a new language to UD]] 
  
-Anna Nedoluzhko, Michal Novak, Martin Popel, Zdenek Zabokrtsky and Daniel Zeman: [[https://lectures.ms.mff.cuni.cz/view.php?rec=475|Coreference meets Universal Dependencies]]. Prague, 19/04/2021. 
  
-Daniel Zeman: [[https://lectures.ms.mff.cuni.cz/view.php?rec=421|Reflexives in Universal Dependencies]]. Prague, 04/03/2019.