AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze13 [2020/10/25 21:04] – [See also] Alexandr Rosenen:cnk:intercorp:verze13 [2021/09/18 12:24] (current) – [Texts in the corpus] Alexandr Rosen
Line 36: Line 36:
 When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:
  
-Rosen, A., Vavřín, M., Zasina, A. J. (2019). //The InterCorp Corpus – Czech((Insert actually used languages.)), version 12 of 19 December 2019//. Institute of the Czech National Corpus, Charles University, Prague 2019. Available on-line: https://kontext.korpus.cz/+Rosen, A., Vavřín, M., Zasina, A. J. (2020). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 13 of 1 November 2020//. Institute of the Czech National Corpus, Charles University, Prague 2020. Available on-line: https://kontext.korpus.cz/
  
 </WRAP> </WRAP>
Line 51: Line 51:
 These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
  
-Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 from December 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.+Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.
  
  
-[{{:cnk:intercorp:intercorp_wordcounts_v13.png|Setup of the parallel corpus – the core and collections}}]+[{{:cnk:intercorp:intercorp_wordcounts_v13.png|Setup of the parallel corpus – the core and collections}}] \\
  
-[{{:cnk:intercorp:intercorp_wordcounts2_v13.png|Setup of the parallel corpus – the core}}]+[{{:cnk:intercorp:intercorp_wordcounts2_v13.png|Setup of the parallel corpus – the core}}] \\
  
 [{{:cnk:intercorp:intercorp_wordcounts3_v13.png|Setup of the parallel corpus – collections}}] [{{:cnk:intercorp:intercorp_wordcounts3_v13.png|Setup of the parallel corpus – collections}}]
Line 62: Line 62:
 ===== Corpus size in thousands of words ===== ===== Corpus size in thousands of words =====
  
-^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ + Language  ^^  Core   Syndicate   Presseurop   Acquis   Europarl   Subtitles   Bible   Total  
-ar  Arabic |  34 |  0 |  0 |  0 |  0 |  0 |  0 |  34 | +^  ar  Arabic |  34 |  0 |  0 |  0 |  0 |  0 |  0 |  34 | 
-be  Balarusian |  5,718 |  0 |  0 |  0 |  0 |  0 |  0 |  5,718 | +^  be  ^ Belarusian |  5,718 |  0 |  0 |  0 |  0 |  0 |  0 |  5,718 | 
- bg  Bulgarian |  7,068 |  0 |  0 |  13,577 |  9,083 |  0 |  0 |  29,728 | + bg  Bulgarian |  7,068 |  0 |  0 |  13,577 |  9,083 |  0 |  0 |  29,728 | 
- ca  Catalan |  7,938 |  0 |  0 |  0 |  0 |  0 |  736 |  8,674 | + ca  Catalan |  7,938 |  0 |  0 |  0 |  0 |  0 |  736 |  8,674 | 
- da  Danish |  7,136 |  0 |  0 |  20,313 |  13,916 |  14,429 |  657 |  56,451 | + da  Danish |  7,136 |  0 |  0 |  20,313 |  13,916 |  14,429 |  657 |  56,451 | 
- de  German |  37,633 |  4,704 |  2,483 |  20,610 |  13,088 |  8,392 |  724 |  87,634 | + de  German |  37,633 |  4,704 |  2,483 |  20,610 |  13,088 |  8,392 |  724 |  87,634 | 
- el  Greek |  0 |  0 |  0 |  23,853 |  15,404 |  23,709 |  0 |  62,966 | + el  Greek |  0 |  0 |  0 |  23,853 |  15,404 |  23,709 |  0 |  62,966 | 
- en  English |  33,569 |  4,856 |  2,670 |  22,902 |  15,576 |  52,106 |  730 |  132,409 | + en  English |  33,569 |  4,856 |  2,670 |  22,902 |  15,576 |  52,106 |  730 |  132,409 | 
- es  Spanish |  26,554 |  5,614 |  2,859 |  26,262 |  16,249 |  36,650 |  0 |  114,187 | + es  Spanish |  26,554 |  5,614 |  2,859 |  26,262 |  16,249 |  36,650 |  0 |  114,187 | 
- et  Estonian |  0 |  0 |  0 |  14,896 |  10,899 |  10,298 |  0 |  36,093 | + et  Estonian |  0 |  0 |  0 |  14,896 |  10,899 |  10,298 |  0 |  36,093 | 
- fi  Finnish |  5,656 |  0 |  0 |  15,269 |  10,108 |  15,047 |  543 |  46,622 | + fi  Finnish |  5,656 |  0 |  0 |  15,269 |  10,108 |  15,047 |  543 |  46,622 | 
- fr  French |  19,773 |  5,600 |  3,046 |  26,200 |  17,179 |  25,986 |  764 |  98,547 | + fr  French |  19,773 |  5,600 |  3,046 |  26,200 |  17,179 |  25,986 |  764 |  98,547 | 
- he  Hebrew |  0 |  0 |  0 |  0 |  0 |  16,221 |  0 |  16,221 | + he  Hebrew |  0 |  0 |  0 |  0 |  0 |  16,221 |  0 |  16,221 | 
- hi  Hindi |  409 |  0 |  0 |  0 |  0 |  0 |  0 |  409 | + hi  Hindi |  409 |  0 |  0 |  0 |  0 |  0 |  0 |  409 | 
- hr  Croatian |  21,923 |  0 |  0 |  0 |  0 |  19,048 |  571 |  41,543 | + hr  Croatian |  21,923 |  0 |  0 |  0 |  0 |  19,048 |  571 |  41,543 | 
- hu  Hungarian |  6,444 |  0 |  0 |  17,852 |  12,198 |  21,115 |  0 |  57,609 | + hu  Hungarian |  6,444 |  0 |  0 |  17,852 |  12,198 |  21,115 |  0 |  57,609 | 
- is  Icelandic |  0 |  0 |  0 |  0 |  0 |  1,581 |  0 |  1,581 | + is  Icelandic |  0 |  0 |  0 |  0 |  0 |  1,581 |  0 |  1,581 | 
- it  Italian |  14,525 |  1,252 |  2,747 |  23,771 |  15,494 |  14,700 |  684 |  73,174 | + it  Italian |  14,525 |  1,252 |  2,747 |  23,771 |  15,494 |  14,700 |  684 |  73,174 | 
- ja  Japanese |  2,189 |  0 |  0 |  0 |  0 |  477 |  0 |  2,666 | + ja  Japanese |  2,189 |  0 |  0 |  0 |  0 |  477 |  0 |  2,666 | 
- lt  Lithuanian |  421 |  0 |  0 |  17,316 |  11,213 |  558 |  471 |  29,979 | + lt  Lithuanian |  421 |  0 |  0 |  17,316 |  11,213 |  558 |  471 |  29,979 | 
- lv  Latvian |  2,646 |  0 |  0 |  17,522 |  11,682 |  280 |  537 |  32,667 | + lv  Latvian |  2,646 |  0 |  0 |  17,522 |  11,682 |  280 |  537 |  32,667 | 
- mk  Macedonian |  8,881 |  0 |  0 |  0 |  0 |  1,877 |  0 |  10,758 | + mk  Macedonian |  8,881 |  0 |  0 |  0 |  0 |  1,877 |  0 |  10,758 | 
- ms  Malay |  0 |  0 |  0 |  0 |  0 |  3,521 |  0 |  3,521 | + ms  Malay |  0 |  0 |  0 |  0 |  0 |  3,521 |  0 |  3,521 | 
- mt  Maltese |  0 |  0 |  0 |  13,935 |  0 |  0 |  0 |  13,935 | + mt  Maltese |  0 |  0 |  0 |  13,935 |  0 |  0 |  0 |  13,935 | 
- nl  Dutch |  16,216 |  813 |  2,953 |  23,416 |  15,558 |  29,373 |  717 |  89,045 | + nl  Dutch |  16,216 |  813 |  2,953 |  23,416 |  15,558 |  29,373 |  717 |  89,045 | 
- no  Norwegian |  7,727 |  0 |  0 |  0 |  0 |  0 |  722 |  8,449 | + no  Norwegian |  7,727 |  0 |  0 |  0 |  0 |  0 |  722 |  8,449 | 
- pl  Polish |  26,200 |  0 |  2,380 |  19,604 |  12,817 |  26,576 |  583 |  88,161 | + pl  Polish |  26,200 |  0 |  2,380 |  19,604 |  12,817 |  26,576 |  583 |  88,161 | 
- pt  Portuguese |  4,981 |  554 |  2,782 |  24,598 |  15,193 |  41,468 |  706 |  90,282 | + pt  Portuguese |  4,981 |  554 |  2,782 |  24,598 |  15,193 |  41,468 |  706 |  90,282 | 
- rn  Romani |  14 |  0 |  0 |  0 |  0 |  0 |  0 |  14 | + rn  Romani |  14 |  0 |  0 |  0 |  0 |  0 |  0 |  14 | 
- ro  Romanian |  4,219 |  0 |  2,738 |  8,092 |  9,446 |  34,128 |  0 |  58,622 | + ro  Romanian |  4,219 |  0 |  2,738 |  8,092 |  9,446 |  34,128 |  0 |  58,622 | 
- ru  Russian |  8,642 |  3,984 |  0 |  0 |  0 |  6,887 |  565 |  20,078 | + ru  Russian |  8,642 |  3,984 |  0 |  0 |  0 |  6,887 |  565 |  20,078 | 
- sk  Slovak |  8,543 |  0 |  0 |  18,399 |  12,727 |  5,133 |  561 |  45,363 | + sk  Slovak |  8,543 |  0 |  0 |  18,399 |  12,727 |  5,133 |  561 |  45,363 | 
- sl  | Slovenian |  3,871 |  0 |  0 |  18,528 |  12,251 |  17,061 |  0 |  51,711 | + sl  ^ Slovene |  3,871 |  0 |  0 |  18,528 |  12,251 |  17,061 |  0 |  51,711 | 
- sq  Albanian |  0 |  0 |  0 |  0 |  0 |  2,003 |  0 |  2,003 | + sq  Albanian |  0 |  0 |  0 |  0 |  0 |  2,003 |  0 |  2,003 | 
- sr  Serbian |  11,582 |  0 |  0 |  0 |  0 |  20,727 |  0 |  32,308 | + sr  Serbian |  11,582 |  0 |  0 |  0 |  0 |  20,727 |  0 |  32,308 | 
- sv  Swedish |  15,790 |  0 |  0 |  19,542 |  13,784 |  14,666 |  638 |  64,419 | + sv  Swedish |  15,790 |  0 |  0 |  19,542 |  13,784 |  14,666 |  638 |  64,419 | 
- tr  Turkish |  0 |  0 |  0 |  0 |  0 |  21,190 |  0 |  21,190 | + tr  Turkish |  0 |  0 |  0 |  0 |  0 |  21,190 |  0 |  21,190 | 
- uk  Ukrainian |  11,459 |  0 |  0 |  0 |  0 |  244 |  596 |  12,299 | + uk  Ukrainian |  11,459 |  0 |  0 |  0 |  0 |  244 |  596 |  12,299 | 
- vi  Vietnamese |  0 |  0 |  0 |  0 |  0 |  1,474 |  0 |  1,474 | + vi  Vietnamese |  0 |  0 |  0 |  0 |  0 |  1,474 |  0 |  1,474 | 
- zh  Chinese |  127 |  240 |  0 |  0 |  0 |  2,247 |  0 |  2,614 | + zh  Chinese |  127 |  240 |  0 |  0 |  0 |  2,247 |  0 |  2,614 | 
-**Subtotal** |   |  327,887 |  27,616 |  24,658 |  406,459 |  263,864 |  489,169 |  11,504 |  1,551,157 | +**Subtotal**  ^|  327,887 |  27,616 |  24,658 |  406,459 |  263,864 |  489,169 |  11,504 |  1,551,157 | 
- cs  |   |  113,839 |  4,351 |  2,310 |  19,085 |  12,908 |  50,604 |  562 |  203,658 | + cs  ^ Czech |  113,839 |  4,351 |  2,310 |  19,085 |  12,908 |  50,604 |  562 |  203,658 | 
-**TOTAL** |   |  441,725 |  31,967 |  26,968 |  425,543 |  276,772 |  539,774 |  12,066 |  1,754,815 |+**TOTAL**  ^|  441,725 |  31,967 |  26,968 |  425,543 |  276,772 |  539,774 |  12,066 |  1,754,815 |
  
 N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Line 112: Line 112:
 Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface. Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface.
  
-^  Language  ^  Tags  ^  Lemmas  ^  Brief description  ^  Detailed description  Tool  ^ +^  Language  ^  Tags  ^  Lemmas  ^  Brief description  ^  Detailed description Tags in the corpus ^ Tool  ^ 
-^ Belarusian |  ✔  |   ✔   |     |  [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)  |  [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]] +^ Belarusian |  ✔  |   ✔    [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)  |  [[https://universaldependencies.org/be/index.html#morphology|in English]]%%****%%)  |   [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_be&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] 
-^ Bulgarian |  ✔  |   ✔    |  [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]]  |  [[http://bultreebank.org/en/resources/short-description-dependency-part-bultreebank-bultreebank-dp/btb-tr03-2/|in English]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Bulgarian |  ✔  |   ✔   |  [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]]   |  [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/BTB-TR03_BulTreeBank_morphosyntactic_tag.pdf|in English]]   [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_bg&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Catalan |  ✔  |  ✔  |  [[http://clic.ub.edu/corpus/webfm_send/18|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Catalan |  ✔  |  ✔  |  [[http://clic.ub.edu/corpus/webfm_send/18|in English]]  |     |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ca&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Chinese |  ✔  |    |  [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]]  |  [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]]  |  [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]] +^ Chinese |  ✔  |    |  [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]]  |  [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_zh&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]] 
-^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]     |  [[https://github.com/uzh/reldi|ReLDI Tagger]]   | +^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|in English]]   |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_hr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]]   | 
-^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]]  |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] +^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]] |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_cs&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] 
-^ Dutch |  ✔  |   ✔    |   [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  |    [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Dutch |  ✔  |  ✔    |   [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  |   |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_nl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ English |  ✔    ✔  |  [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] +^ English |  ✔    ✔  |  [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_en&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus|in Estonian and English]]  |       [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_et&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Finnish |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/finntreebank/|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] + [[https://code.google.com/archive/p/hunpos/|HunPOS]] +^ Finnish |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/finntreebank|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%)  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_fi&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  |[[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] 
-^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ French |  ✔  |  ✔  |  [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html|in English]]  |     |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_fr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ German |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]]%%**%%  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] +^ German |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]] %%**%%  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_de&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] 
-^ Hungarian |  ✔  |   |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v12_hu&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=|List]]  |  [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]]  |   [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] +^ Hungarian |  ✔  |     |    |  [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_hu&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] 
-^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]       |  [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] +^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]    [[http://nlp.cs.ru.is/pdf/Tagset.pdf|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_is&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] 
-^ Italian |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Italian |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]]   |     |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_it&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Japanese |  ✔  |  ✔  |  [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]]        [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]] +^ Japanese |  ✔  |  ✔  |  [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]]       |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ja&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]] 
-^ Latvian |  ✔  |  ✔  |   [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]]  |      [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] +^ Latvian |  ✔  |  ✔  |   [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]]  |     |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_lv&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] 
-^ Norwegian |  ✔  |  ✔  | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]]      [[https://visl.sdu.dk/remoting.html|VISL]]  | +^ Norwegian |  ✔  |  ✔  |  [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]]       [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_no&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[https://github.com/noklesta/The-Oslo-Bergen-Tagger|Oslo-Bergen Tagger]]  | 
-^ Polish |  ✔  |  ✔  |  [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]]  |  [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]]  |  [[http://sgjp.pl/morfeusz/|Morfeusz]] [[https://github.com/kwrobel-nlp/krnnt|KRNNT]]   +^ Polish |  ✔  |  ✔  |  [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]]  |  [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_pl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  |[[http://sgjp.pl/morfeusz/|Morfeusz]][[https://github.com/kwrobel-nlp/krnnt|KRNNT]]  
-^ Portuguese |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Portuguese |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]]  |     |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_pt&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Russian |  ✔  |  ✔  |  [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]]%%***%%  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Russian |  ✔  |  ✔  |  [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%% |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ru&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Slovak |  ✔  |  ✔  |  [[https://korpus.sk/morpho_en.html/|in English]]  |  [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]]  |  [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] +^ Slovak |  ✔  |  ✔  |  [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]]  |  [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] 
-^ Slovene |  ✔  |  ✔  |  [[https://www.sketchengine.eu/slovene-tagset-multext-east-v3/|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.html|in English]]  |  [[http://nl2.ijs.si/analyze/|ToTaLe]]  | +^ Slovene |  ✔  |  ✔  |   |  [[http://nl.ijs.si/jos/msd/html-en/josMSD-en.html|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]]  | 
-^ Serbian |  ✔  |  ✔  |  [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]]  |   [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]]  |  [[https://github.com/uzh/reldi|ReLDI Tagger]]   | +^ Serbian |  ✔  |  ✔  |  [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]]  |   [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]]   |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]]   | 
-^ Spanish |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] +^ Spanish |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]]  |     |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_es&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] 
-^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] +^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]       |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sv&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] 
-^ Ukrainian |  ✔  |  ✔  |  |  [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)   |  [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]]  |+^ Ukrainian |  ✔  |  ✔  |  |  [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_uk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]]  |
  
  
Line 151: Line 151:
 <wrap lo>%%****%%) The tag is in the UD (Universal Dependencies) format, components of the tag are separated by a vertical bar (|), e.g. the form школы in genitive singular is tagged as: ''NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing''. The query can be specified in the same way as for other languages, treating the tag as a string, i.e.\ ''[tag=%%"NOUN.*Case=Gen\|Gender=Fem.*"%%]'' or the tag components can be specified separately: ''[tag=%%"Case=Gen"%% & tag=%%"NOUN"%% & tag=%%"Gender=Fem"%%]'' (the order of categories is not significant). The result is identical in either case.</wrap>  <wrap lo>%%****%%) The tag is in the UD (Universal Dependencies) format, components of the tag are separated by a vertical bar (|), e.g. the form школы in genitive singular is tagged as: ''NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing''. The query can be specified in the same way as for other languages, treating the tag as a string, i.e.\ ''[tag=%%"NOUN.*Case=Gen\|Gender=Fem.*"%%]'' or the tag components can be specified separately: ''[tag=%%"Case=Gen"%% & tag=%%"NOUN"%% & tag=%%"Gender=Fem"%%]'' (the order of categories is not significant). The result is identical in either case.</wrap> 
    
-Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. In a page displaying results open the **View/Corpus-specific settings...** menu to check the //tag// option in the **Positional attributes** box and choose the //for each token// option in the **Viewing options** box.+Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. You can have all tags used in the corpus for a given language listed – see the column **Tags in the corpus** in the table above. Or in a page displaying results open the **View/Corpus-specific settings...** menu to check the //tag// option in the **Positional attributes** box and choose the //for each token// option in the **Viewing options** box.
  
 Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma. Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.
Line 225: Line 225:
   * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík)   * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík)
   * [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička)   * [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička)
-  * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (thanks to Tomaž Erjavec)+  * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (until Release 11, thanks to Tomaž Erjavec)
   * [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] for German   * [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] for German
   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)
   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)
-  *  [[https://github.com/uzh/reldi/tree/master/tools/tagger|RelDI tagger]] for Croatian and Serbian (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]])+  *  [[https://github.com/clarinsi/reldi-tagger|RelDI tagger]] for Croatian, Serbian((Ljubešić, N., Klubička, F., Željko Agić, and Jazbec, I.-P. (2016). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Calzolari, N. et al., editors, //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)//, Paris, France. European Language Resources Association (ELRA).)) and Slovene((Ljubešić, N. and Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Calzolari, N. et al., editors, //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)//, Paris, France. European Language Resources Association (ELRA).)) (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]])
   * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)   * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)
   * [[http://ufal.mff.cuni.cz/udpipe|UD Pipe]] for Belarusian and Ukrainian (thanks to Bohdan Moskalevskyi)   * [[http://ufal.mff.cuni.cz/udpipe|UD Pipe]] for Belarusian and Ukrainian (thanks to Bohdan Moskalevskyi)