AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
en:cnk:intercorp:verze9 [2016/06/30 16:56] – created Adrian Zasinaen:cnk:intercorp:verze9 [2019/10/06 20:43] (current) – [Taggers/lemmatizers:] Michal Škrabal
Line 6: Line 6:
 <WRAP right> <WRAP right>
 ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^
-^ Positions ^ Number of tokens |  120 443 181 |  117 981 673 |  278 445 878 |  1 556 840 965 | +^ Positions ^ Number of tokens |  120,443,181 |  117,981,673 |  278,445,878 |  1,556,840,965 | 
-^ ::: ^ Number of word forms |  96 956 714 |  89 645 545 |  231 501 606 |  1 228 896 294 | +^ ::: ^ Number of word forms |  96,956,714 |  89,645,545 |  231,501,606 |  1,228,896,294 | 
-^ Structural attributes ^ Number of documents |  1430 |  5 |  2 934 |  89 | +^ Structural attributes ^ Number of documents |  1430 |  5 |  2,934 |  89 | 
-^ ::: ^ Number of div |  1 430 |  111 263 |  2 934 |  1 849 184 | +^ ::: ^ Number of div |  1,430 |  111,263 |  2,934 |  1,849,184 | 
-^ ::: ^ Number of sentences |  8 308 814 |  13 588 082 |  17 210 601 |  143 478 514 |+^ ::: ^ Number of sentences |  8,308,814 |  13,588,082 |  17,210,601 |  143,478,514 |
 ^ Further information ^ reference |  YES   ^^^^ ^ Further information ^ reference |  YES   ^^^^
 ^ ::: ^ representative |  NO  ^^^^ ^ ::: ^ representative |  NO  ^^^^
Line 25: Line 25:
 After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
  
-InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial in Czech is available [[kurz:uvod|here]].+InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]].
  
 After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.
  
 New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).
- 
  
 ===== References ===== ===== References =====
Line 44: Line 43:
 When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:
  
-Rosen, A., Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), version 7 from 19 Dec 2014//. Institute of the Czech National Corpus, Charles University, Prague 2014. Available on-line: http://www.korpus.cz+Rosen, A., Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), version 7 of 19 Dec 2014//. Institute of the Czech National Corpus, Charles University, Prague 2014. Available on-line: http://www.korpus.cz
  
 </WRAP> </WRAP>
Line 51: Line 50:
 The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The choice in the present release includes: The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The choice in the present release includes:
  
-  * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.presseurop.eu/|Presseurop]]+  * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.voxeurop.eu|VoxEurop]] (formerly PressEurop)
   * A package of legal texts of the European Union form the [[http://langtech.jrc.it/JRC-Acquis.html|Acquis Communautaire]] corpus   * A package of legal texts of the European Union form the [[http://langtech.jrc.it/JRC-Acquis.html|Acquis Communautaire]] corpus
   * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus   * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus
   * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database   * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database
  
-These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.+These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
  
-Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 9 from July 2016 is 231 mil. words in the aligned foreign language texts in the core part and 1,228 mil. words in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the sizes in millions of words.+Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 9 from July 2016 is 231 mil. words in the aligned foreign language texts in the core part and 1,228 mil. words in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.
  
  
Line 66: Line 65:
  
 [{{:cnk:intercorp_wordcounts3_v9.png|Setup of the parallel corpus – collections}}] [{{:cnk:intercorp_wordcounts3_v9.png|Setup of the parallel corpus – collections}}]
- 
  
 ===== Corpus size in thousands of words ===== ===== Corpus size in thousands of words =====
  
 ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^
-|  ar  | Arabic | 34 |  0 |  0 |  0 |  0 |  0 |  34 | +|  ar  | Arabic |  34 |  0 |  0 |  0 |  0 |  0 |  34 | 
-|  be  | Belarusian |  3 025 |  0 |  0 |  0 |  0 |  0 |  3 025 | +|  be  | Belarusian |  3,025 |  0 |  0 |  0 |  0 |  0 |  3,025 | 
-|  bg  | Bulgarian |  6 007 |  0 |  0 |  13 816 |  9 083 |  0 |  28 907 | +|  bg  | Bulgarian |  6,007 |  0 |  0 |  13,816 |  9,083 |  0 |  28,907 | 
-|  ca  | Catalan |  4 632 |  0 |  0 |  0 |  0 |  0 |  4 632 | +|  ca  | Catalan |  4,632 |  0 |  0 |  0 |  0 |  0 |  4,632 | 
-|  da  | Danish |  3 556 |  0 |  0 |  21 679 |  13 915 |  14 429 |  53 581 | +|  da  | Danish |  3,556 |  0 |  0 |  21,679 |  13,915 |  14,429 |  53,581 | 
-|  de  | German |  31 168 |  3 725 |  2 482 |  21 723 |  13 089 |  8 366 |  80 556 | +|  de  | German |  31,168 |  3,725 |  2,482 |  21,723 |  13,089 |  8,366 |  80,556 | 
-|  el  | Greek |  0 |  0 |  0 |  25 069 |  15 403 |  23 714 |  64 187 | +|  el  | Greek |  0 |  0 |  0 |  25,069 |  15,403 |  23,714 |  64,187 | 
-|  en  | English |  21 208 |  3 818 |  2 670 |  24 207 |  15 580 |  52 101 |  119 586 | +|  en  | English |  21,208 |  3,818 |  2,670 |  24,207 |  15,580 |  52,101 |  119,586 | 
-|  es  | Spanish |  19 310 |  4 324 |  2 816 |  27 001 |  15 885 |  36 378 |  105 716 | +|  es  | Spanish |  19,310 |  4,324 |  2,816 |  27,001 |  15,885 |  36,378 |  105,716 | 
-|  et  | Estonian |  0 |  0 |  0 |  15 962 |  10 899 |  10 296 |  37 158 | +|  et  | Estonian |  0 |  0 |  0 |  15,962 |  10,899 |  10,296 |  37,158 | 
-|  fi  | Finnish |  3 645 |  0 |  0 |  16 455 |  10 175 |  15 097 |  45 373 | +|  fi  | Finnish |  3,645 |  0 |  0 |  16,455 |  10,175 |  15,097 |  45,373 | 
-|  fr  | French |  12 406 |  4 393 |  2 928 |  27 351 |  17 178 |  25 961 |  90 219 | +|  fr  | French |  12,406 |  4,393 |  2,928 |  27,351 |  17,178 |  25,961 |  90,219 | 
-|  he  | Hebrew |  0 |  0 |  0 |  0 |  0 |  16 221 |  16 221 |+|  he  | Hebrew |  0 |  0 |  0 |  0 |  0 |  16,221 |  16,221 |
 |  hi  | Hindu |  408 |  0 |  0 |  0 |  0 |  0 |  408 | |  hi  | Hindu |  408 |  0 |  0 |  0 |  0 |  0 |  408 |
-|  hr  | Croatian |  19 980 |  0 |  0 |  0 |  0 |  19 042 |  39 023 | +|  hr  | Croatian |  19,980 |  0 |  0 |  0 |  0 |  19,042 |  39 023 | 
-|  hu  | Hungarian |  5 818 |  0 |  0 |  19 176 |  12 306 |  21 239 |  58 541 | +|  hu  | Hungarian |  5,818 |  0 |  0 |  19,176 |  12,306 |  21,239 |  58,541 | 
-|  is  | Icelandic |  0 |  0 |  0 |  0 |  0 |  1 584 |  1 584 | +|  is  | Icelandic |  0 |  0 |  0 |  0 |  0 |  1,584 |  1,584 | 
-|  it  | Italian |  8 694 |  651 |  2 707 |  24 849 |  15 489 |  14 653 |  67 046 |+|  it  | Italian |  8,694 |  651 |  2,707 |  24,849 |  15,489 |  14,653 |  67,046 |
 |  ja  | Japanese |  0 |  0 |  0 |  0 |  0 |  113 |  113 | |  ja  | Japanese |  0 |  0 |  0 |  0 |  0 |  113 |  113 |
-|  lt  | Lithuanian |  358 |  0 |  0 |  18 392 |  11 212 |  557 |  30 521 | +|  lt  | Lithuanian |  358 |  0 |  0 |  18,392 |  11,212 |  557 |  30,521 | 
-|  lv  | Latvian |  1 336 |  0 |  0 |  18 709 |  11 682 |  279 |  32 007 +|  lv  | Latvian |  1,666 |  0 |  0 |  24,667 |  13,895 |  381 |  40,609 
-|  mk  | Macedonian |  4 663 |  0 |  0 |  0 |  0 |  1 877 |  6 540 | +|  mk  | Macedonian |  4,663 |  0 |  0 |  0 |  0 |  1,877 |  6,540 | 
-|  ms  | Malay |  0 |  0 |  0 |  0 |  0 |  3 520 |  3 520 | +|  ms  | Malay |  0 |  0 |  0 |  0 |  0 |  3,520 |  3,520 | 
-|  mt  | Maltese |  0 |  0 |  0 |  14 133 |  0 |  0 |  14 133 | +|  mt  | Maltese |  0 |  0 |  0 |  14,133 |  0 |  0 |  14,133 | 
-|  nl  | Dutch |  11 444 |  314 |  2 955 |  24 746 |  15 563 |  29 362 |  84 386 | +|  nl  | Dutch |  11,444 |  314 |  2,955 |  24,746 |  15,563 |  29,362 |  84,386 | 
-|  no  | Norwegian |  4 965 |  0 |  0 |  0 |  0 |  0 |  4 965 | +|  no  | Norwegian |  4,965 |  0 |  0 |  0 |  0 |  0 |  4,965 | 
-|  pl  | Polish |  21 433 |  0 |  2 378 |  20 627 |  12 811 |  26 572 |  83 822 | +|  pl  | Polish |  21,433 |  0 |  2,378 |  20,627 |  12|  26,572 |  83,822 | 
-|  pt  | Portuguese |  2 605 |  369 |  2 999 |  28 602 |  16 484 |  43 391 |  94 454 |+|  pt  | Portuguese |  2,605 |  369 |  2,999 |  28,602 |  16,484 |  43,391 |  94,454 |
 |  rn  | Romani |  5 |  0 |  0 |  0 |  0 |  0 |  5 | |  rn  | Romani |  5 |  0 |  0 |  0 |  0 |  0 |  5 |
-|  ro  | Romanian |  3 432 |  0 |  2 737 |  8 199 |  9 446 |  34 128 |  57 944 | +|  ro  | Romanian |  3,432 |  0 |  2,737 |  8,199 |  9,446 |  34,128 |  57,944 | 
-|  ru  | Russian |  4 788 |  3 174 |  0 |  0 |  0 |  6 885 |  14 848 | +|  ru  | Russian |  4,788 |  3,174 |  0 |  0 |  0 |  6,885 |  14,848 | 
-|  sk  | Slovak |  8 066 |  0 |  0 |  19 222 |  12 734 |  5 134 |  45 158 | +|  sk  | Slovak |  8,066 |  0 |  0 |  19,222 |  12,734 |  5,134 |  45,158 | 
-|  sl  | Slovenian |  2 057 |  0 |  0 |  19 645 |  12 240 |  17 024 |  50 968 | +|  sl  | Slovenian |  2,057 |  0 |  0 |  19,645 |  12,240 |  17,024 |  50,968 | 
-|  sq  | Albanian |  0 |  0 |  0 |  0 |  0 |  2 003 |  2 003 | +|  sq  | Albanian |  0 |  0 |  0 |  0 |  0 |  2,003 |  2,003 | 
-|  sr  | Serbian |  9 886 |  0 |  0 |  0 |  0 |  20 720 |  30 607 | +|  sr  | Serbian |  9,886 |  0 |  0 |  0 |  0 |  20,720 |  30,607 | 
-|  sv  | Swedish |  8 959 |  0 |  0 |  20 585 |  13 840 |  14 693 |  58 079 | +|  sv  | Swedish |  8,959 |  0 |  0 |  20,585 |  13,840 |  14,693 |  58,079 | 
-|  tr  | Turkish |  0 |  0 |  0 |  0 |  0 |  21 190 |  21 190 | +|  tr  | Turkish |  0 |  0 |  0 |  0 |  0 |  21,190 |  21,190 | 
-|  uk  | Ukrainian |  7 597 |  0 |  0 |  0 |  0 |  246 |  7 843 | +|  uk  | Ukrainian |  7,597 |  0 |  0 |  0 |  0 |  246 |  7,843 | 
-|  vi  | Vietnamese |  0 |  0 |  0 |  0 |  0 |  1 473 |  1 473 | +|  vi  | Vietnamese |  0 |  0 |  0 |  0 |  0 |  1,473 |  1,473 | 
-| **Subtotal** |  |  231 501 |  20 769 |  24 676 |  430 160 |  265 022 |  488 266 |  1 460 397 | +| **Subtotal** |  |  231,501 |  20,769 |  24,676 |  430,160 |  265,022 |  488,266 |  1,460,397 | 
-|  cs  | Czech |  96 956 |  3 416 |  2 315 |  20 303 |  12 922 |  50 688 |  186 602 | +|  cs  | Czech |  96,956 |  3,416 |  2,315 |  20,303 |  12,922 |  50,688 |  186,602 | 
-| **TOTAL** |  |  328 458 |  24 186 |  26 991 |  450 463 |  277 945 |  538 954 |  1 647 000 |+| **TOTAL** |  |  328,458 |  24,186 |  26,991 |  450,463 |  277,945 |  538,954 |  1,647,000 |
  
 N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Line 124: Line 122:
 ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   | ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   |
 ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  | ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  |
-^ Dutch |  ✔  |         |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |+^ Dutch |  ✔  |       [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ Finnish |  ✔  |  ✔  |      [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|Enlish]]((The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].))    [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]]  |+^ Finnish |  ✔  |  ✔  |      [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|English]]((The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].))    [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]]  |
 ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ German |  ✔  |  ✔  |  [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]] %%**%%)  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |+^ German |  ✔  |  ✔  |  [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]]((Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.))  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |
 ^ Hungarian |  ✔  |          [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]]  |  [[http://code.google.com/p/hunpos/|HunPos]]  | ^ Hungarian |  ✔  |          [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]]  |  [[http://code.google.com/p/hunpos/|HunPos]]  |
 ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  | ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  |
Line 144: Line 142:
 ^ Spanish |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Spanish |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  | ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  |
 +
  
 Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma. Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.
  
 Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$". Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$".
- 
- 
 ====Structural attributes==== ====Structural attributes====
  
Line 196: Line 193:
 ==== Pre-processing ==== ==== Pre-processing ====
  
-  * parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička+  * Parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička
   * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]]   * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]]
   * Sentence splitter for Czech by Pavel Květoň   * Sentence splitter for Czech by Pavel Květoň
Line 215: Line 212:
   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)
   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)
 +  *  [[https://github.com/uzh/reldi/tree/master/tools/tagger|RelDI tagger]] for Croatian and Serbian (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]]) 
 +  * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)