AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze9 [2016/07/11 11:08] – [Texts in the corpus] Alexandr Rosenen:cnk:intercorp:verze9 [2019/10/06 20:43] (current) – [Taggers/lemmatizers:] Michal Škrabal
Line 25: Line 25:
 After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
  
-InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial in Czech is available [[kurz:uvod|here]].+InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]].
  
 After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.
  
 New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).
- 
  
 ===== References ===== ===== References =====
Line 56: Line 55:
   * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database   * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database
  
-These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.+These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
  
-Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 9 from July 2016 is 231 mil. words in the aligned foreign language texts in the core part and 1,228 mil. words in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the sizes in millions of words.+Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 9 from July 2016 is 231 mil. words in the aligned foreign language texts in the core part and 1,228 mil. words in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.
  
  
Line 70: Line 69:
  
 ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^
-|  ar  | Arabic | 34 |  0 |  0 |  0 |  0 |  0 |  34 | +|  ar  | Arabic |  34 |  0 |  0 |  0 |  0 |  0 |  34 | 
-|  be  | Belarusian |  3 025 |  0 |  0 |  0 |  0 |  0 |  3 025 | +|  be  | Belarusian |  3,025 |  0 |  0 |  0 |  0 |  0 |  3,025 | 
-|  bg  | Bulgarian |  6 007 |  0 |  0 |  13 816 |  9 083 |  0 |  28 907 | +|  bg  | Bulgarian |  6,007 |  0 |  0 |  13,816 |  9,083 |  0 |  28,907 | 
-|  ca  | Catalan |  4 632 |  0 |  0 |  0 |  0 |  0 |  4 632 | +|  ca  | Catalan |  4,632 |  0 |  0 |  0 |  0 |  0 |  4,632 | 
-|  da  | Danish |  3 556 |  0 |  0 |  21 679 |  13 915 |  14 429 |  53 581 | +|  da  | Danish |  3,556 |  0 |  0 |  21,679 |  13,915 |  14,429 |  53,581 | 
-|  de  | German |  31 168 |  3 725 |  2 482 |  21 723 |  13 089 |  8 366 |  80 556 | +|  de  | German |  31,168 |  3,725 |  2,482 |  21,723 |  13,089 |  8,366 |  80,556 | 
-|  el  | Greek |  0 |  0 |  0 |  25 069 |  15 403 |  23 714 |  64 187 | +|  el  | Greek |  0 |  0 |  0 |  25,069 |  15,403 |  23,714 |  64,187 | 
-|  en  | English |  21 208 |  3 818 |  2 670 |  24 207 |  15 580 |  52 101 |  119 586 | +|  en  | English |  21,208 |  3,818 |  2,670 |  24,207 |  15,580 |  52,101 |  119,586 | 
-|  es  | Spanish |  19 310 |  4 324 |  2 816 |  27 001 |  15 885 |  36 378 |  105 716 | +|  es  | Spanish |  19,310 |  4,324 |  2,816 |  27,001 |  15,885 |  36,378 |  105,716 | 
-|  et  | Estonian |  0 |  0 |  0 |  15 962 |  10 899 |  10 296 |  37 158 | +|  et  | Estonian |  0 |  0 |  0 |  15,962 |  10,899 |  10,296 |  37,158 | 
-|  fi  | Finnish |  3 645 |  0 |  0 |  16 455 |  10 175 |  15 097 |  45 373 | +|  fi  | Finnish |  3,645 |  0 |  0 |  16,455 |  10,175 |  15,097 |  45,373 | 
-|  fr  | French |  12 406 |  4 393 |  2 928 |  27 351 |  17 178 |  25 961 |  90 219 | +|  fr  | French |  12,406 |  4,393 |  2,928 |  27,351 |  17,178 |  25,961 |  90,219 | 
-|  he  | Hebrew |  0 |  0 |  0 |  0 |  0 |  16 221 |  16 221 |+|  he  | Hebrew |  0 |  0 |  0 |  0 |  0 |  16,221 |  16,221 |
 |  hi  | Hindu |  408 |  0 |  0 |  0 |  0 |  0 |  408 | |  hi  | Hindu |  408 |  0 |  0 |  0 |  0 |  0 |  408 |
-|  hr  | Croatian |  19 980 |  0 |  0 |  0 |  0 |  19 042 |  39 023 | +|  hr  | Croatian |  19,980 |  0 |  0 |  0 |  0 |  19,042 |  39 023 | 
-|  hu  | Hungarian |  5 818 |  0 |  0 |  19 176 |  12 306 |  21 239 |  58 541 | +|  hu  | Hungarian |  5,818 |  0 |  0 |  19,176 |  12,306 |  21,239 |  58,541 | 
-|  is  | Icelandic |  0 |  0 |  0 |  0 |  0 |  1 584 |  1 584 | +|  is  | Icelandic |  0 |  0 |  0 |  0 |  0 |  1,584 |  1,584 | 
-|  it  | Italian |  8 694 |  651 |  2 707 |  24 849 |  15 489 |  14 653 |  67 046 |+|  it  | Italian |  8,694 |  651 |  2,707 |  24,849 |  15,489 |  14,653 |  67,046 |
 |  ja  | Japanese |  0 |  0 |  0 |  0 |  0 |  113 |  113 | |  ja  | Japanese |  0 |  0 |  0 |  0 |  0 |  113 |  113 |
-|  lt  | Lithuanian |  358 |  0 |  0 |  18 392 |  11 212 |  557 |  30 521 | +|  lt  | Lithuanian |  358 |  0 |  0 |  18,392 |  11,212 |  557 |  30,521 | 
-|  lv  | Latvian |  1 336 |  0 |  0 |  18 709 |  11 682 |  279 |  32 007 +|  lv  | Latvian |  1,666 |  0 |  0 |  24,667 |  13,895 |  381 |  40,609 
-|  mk  | Macedonian |  4 663 |  0 |  0 |  0 |  0 |  1 877 |  6 540 | +|  mk  | Macedonian |  4,663 |  0 |  0 |  0 |  0 |  1,877 |  6,540 | 
-|  ms  | Malay |  0 |  0 |  0 |  0 |  0 |  3 520 |  3 520 | +|  ms  | Malay |  0 |  0 |  0 |  0 |  0 |  3,520 |  3,520 | 
-|  mt  | Maltese |  0 |  0 |  0 |  14 133 |  0 |  0 |  14 133 | +|  mt  | Maltese |  0 |  0 |  0 |  14,133 |  0 |  0 |  14,133 | 
-|  nl  | Dutch |  11 444 |  314 |  2 955 |  24 746 |  15 563 |  29 362 |  84 386 | +|  nl  | Dutch |  11,444 |  314 |  2,955 |  24,746 |  15,563 |  29,362 |  84,386 | 
-|  no  | Norwegian |  4 965 |  0 |  0 |  0 |  0 |  0 |  4 965 | +|  no  | Norwegian |  4,965 |  0 |  0 |  0 |  0 |  0 |  4,965 | 
-|  pl  | Polish |  21 433 |  0 |  2 378 |  20 627 |  12 811 |  26 572 |  83 822 | +|  pl  | Polish |  21,433 |  0 |  2,378 |  20,627 |  12|  26,572 |  83,822 | 
-|  pt  | Portuguese |  2 605 |  369 |  2 999 |  28 602 |  16 484 |  43 391 |  94 454 |+|  pt  | Portuguese |  2,605 |  369 |  2,999 |  28,602 |  16,484 |  43,391 |  94,454 |
 |  rn  | Romani |  5 |  0 |  0 |  0 |  0 |  0 |  5 | |  rn  | Romani |  5 |  0 |  0 |  0 |  0 |  0 |  5 |
-|  ro  | Romanian |  3 432 |  0 |  2 737 |  8 199 |  9 446 |  34 128 |  57 944 | +|  ro  | Romanian |  3,432 |  0 |  2,737 |  8,199 |  9,446 |  34,128 |  57,944 | 
-|  ru  | Russian |  4 788 |  3 174 |  0 |  0 |  0 |  6 885 |  14 848 | +|  ru  | Russian |  4,788 |  3,174 |  0 |  0 |  0 |  6,885 |  14,848 | 
-|  sk  | Slovak |  8 066 |  0 |  0 |  19 222 |  12 734 |  5 134 |  45 158 | +|  sk  | Slovak |  8,066 |  0 |  0 |  19,222 |  12,734 |  5,134 |  45,158 | 
-|  sl  | Slovenian |  2 057 |  0 |  0 |  19 645 |  12 240 |  17 024 |  50 968 | +|  sl  | Slovenian |  2,057 |  0 |  0 |  19,645 |  12,240 |  17,024 |  50,968 | 
-|  sq  | Albanian |  0 |  0 |  0 |  0 |  0 |  2 003 |  2 003 | +|  sq  | Albanian |  0 |  0 |  0 |  0 |  0 |  2,003 |  2,003 | 
-|  sr  | Serbian |  9 886 |  0 |  0 |  0 |  0 |  20 720 |  30 607 | +|  sr  | Serbian |  9,886 |  0 |  0 |  0 |  0 |  20,720 |  30,607 | 
-|  sv  | Swedish |  8 959 |  0 |  0 |  20 585 |  13 840 |  14 693 |  58 079 | +|  sv  | Swedish |  8,959 |  0 |  0 |  20,585 |  13,840 |  14,693 |  58,079 | 
-|  tr  | Turkish |  0 |  0 |  0 |  0 |  0 |  21 190 |  21 190 | +|  tr  | Turkish |  0 |  0 |  0 |  0 |  0 |  21,190 |  21,190 | 
-|  uk  | Ukrainian |  7 597 |  0 |  0 |  0 |  0 |  246 |  7 843 | +|  uk  | Ukrainian |  7,597 |  0 |  0 |  0 |  0 |  246 |  7,843 | 
-|  vi  | Vietnamese |  0 |  0 |  0 |  0 |  0 |  1 473 |  1 473 | +|  vi  | Vietnamese |  0 |  0 |  0 |  0 |  0 |  1,473 |  1,473 | 
-| **Subtotal** |  |  231 501 |  20 769 |  24 676 |  430 160 |  265 022 |  488 266 |  1 460 397 | +| **Subtotal** |  |  231,501 |  20,769 |  24,676 |  430,160 |  265,022 |  488,266 |  1,460,397 | 
-|  cs  | Czech |  96 956 |  3 416 |  2 315 |  20 303 |  12 922 |  50 688 |  186 602 | +|  cs  | Czech |  96,956 |  3,416 |  2,315 |  20,303 |  12,922 |  50,688 |  186,602 | 
-| **TOTAL** |  |  328 458 |  24 186 |  26 991 |  450 463 |  277 945 |  538 954 |  1 647 000 |+| **TOTAL** |  |  328,458 |  24,186 |  26,991 |  450,463 |  277,945 |  538,954 |  1,647,000 |
  
 N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Line 123: Line 122:
 ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   | ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   |
 ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  | ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  |
-^ Dutch |  ✔  |         |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |+^ Dutch |  ✔  |       [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ Finnish |  ✔  |  ✔  |      [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|Enlish]]((The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].))    [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]]  |+^ Finnish |  ✔  |  ✔  |      [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|English]]((The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].))    [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]]  |
 ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ German |  ✔  |  ✔  |  [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]] %%**%%)  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |+^ German |  ✔  |  ✔  |  [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]]((Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.))  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |
 ^ Hungarian |  ✔  |          [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]]  |  [[http://code.google.com/p/hunpos/|HunPos]]  | ^ Hungarian |  ✔  |          [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]]  |  [[http://code.google.com/p/hunpos/|HunPos]]  |
 ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  | ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  |
Line 143: Line 142:
 ^ Spanish |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Spanish |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  | ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  |
 +
  
 Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma. Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.
  
 Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$". Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$".
- 
- 
 ====Structural attributes==== ====Structural attributes====
  
Line 195: Line 193:
 ==== Pre-processing ==== ==== Pre-processing ====
  
-  * parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička+  * Parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička
   * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]]   * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]]
   * Sentence splitter for Czech by Pavel Květoň   * Sentence splitter for Czech by Pavel Květoň
Line 214: Line 212:
   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)
   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)
 +  *  [[https://github.com/uzh/reldi/tree/master/tools/tagger|RelDI tagger]] for Croatian and Serbian (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]]) 
 +  * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)