Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:intercorp:verze15 [2022/11/23 13:25] – [Texts in the corpus] alexandrrosen | en:cnk:intercorp:verze15 [2024/04/18 13:47] (current) – [Morphosyntactic annotation] jankocek |
---|
| |
^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ | ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ |
^ ar ^ Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | | ^ ar ^ Arabic | 34 | 384 | 0 | 0 | 0 | 0 | 0 | 418 | |
^ be ^ Belarusian| 6 094 | 0 | 0 | 0 | 0 | 0 | 0 | 6 094 | | ^ be ^ Belarusian | 6 524 | 0 | 0 | 0 | 0 | 0 | 0 | 6 524 | |
^ bg ^ Bulgarian | 7 068 | 0 | 0 | 13 577 | 9 083 | 0 | 0 | 29 728 | | ^ bg ^ Bulgarian | 7 068 | 0 | 0 | 13 577 | 9 083 | 0 | 0 | 29 728 | |
^ ca ^ Catalan | 8 920 | 0 | 0 | 0 | 0 | 0 | 736 | 9 656 | | ^ ca ^ Catalan | 8 920 | 0 | 0 | 0 | 0 | 0 | 736 | 9 656 | |
^ da ^ Danish | 7 576 | 0 | 0 | 20 313 | 13 916 | 14 429 | 657 | 56 891 | | ^ da ^ Danish | 8 456 | 0 | 0 | 20 313 | 13 916 | 14 429 | 657 | 57 770 | |
^ de ^ German | 38 475 | 4 704 | 2 483 | 20 610 | 13 088 | 8 392 | 724 | 88 476 | | ^ de ^ German | 39 412 | 5 067 | 2 483 | 20 610 | 13 088 | 8 392 | 724 | 89 776 | |
^ el ^ Greek | 0 | 0 | 0 | 23 853 | 15 404 | 23 709 | 0 | 62 966 | | ^ el ^ Greek | 0 | 0 | 0 | 23 853 | 15 404 | 23 709 | 0 | 62 966 | |
^ en ^ English | 36 198 | 4 856 | 2 670 | 22 902 | 15 576 | 52 106 | 730 | 135 038 | | ^ en ^ English | 38 706 | 5 273 | 2 670 | 22 902 | 15 576 | 52 106 | 730 | 137 964 | |
^ es ^ Spanish | 28 115 | 5 614 | 2 859 | 26 262 | 16 249 | 36 650 | 0 | 115 748 | | ^ es ^ Spanish | 29 145 | 6 074 | 2 859 | 26 262 | 16 249 | 36 650 | 0 | 117 239 | |
^ et ^ Estonian | 0 | 0 | 0 | 14 896 | 10 899 | 10 298 | 0 | 36 093 | | ^ et ^ Estonian | 0 | 0 | 0 | 14 896 | 10 899 | 10 298 | 0 | 36 093 | |
^ fi ^ Finnish | 6 226 | 0 | 0 | 15 269 | 10 108 | 15 047 | 543 | 47 192 | | ^ fi ^ Finnish | 6 674 | 0 | 0 | 15 269 | 10 108 | 15 047 | 543 | 47 641 | |
^ fr ^ French | 21 279 | 5 600 | 3 046 | 26 200 | 17 179 | 25 986 | 764 | 100 054 | | ^ fr ^ French | 21 996 | 5 896 | 3 046 | 26 200 | 17 179 | 25 986 | 764 | 101 067 | |
^ he ^ Hebrew | 0 | 0 | 0 | 0 | 0 | 16 221 | 0 | 16 221 | | ^ he ^ Hebrew | 0 | 0 | 0 | 0 | 0 | 16 221 | 0 | 16 221 | |
^ hi ^ Hindi | 409 | 0 | 0 | 0 | 0 | 0 | 0 | 409 | | ^ hi ^ Hindi | 409 | 0 | 0 | 0 | 0 | 0 | 0 | 409 | |
^ hr ^ Croatian | 22 736 | 0 | 0 | 0 | 0 | 19 048 | 571 | 42 356 | | ^ hr ^ Croatian | 23 351 | 0 | 0 | 0 | 0 | 19 048 | 571 | 42 971 | |
^ hs ^ Upper Sorbian | 110 | 0 | 0 | 0 | 0 | 0 | 0 | 110 | | ^ hs ^ Upper | 128 | 0 | 0 | 0 | 0 | 0 | 0 | 128 | |
^ hu ^ Hungarian | 6 444 | 0 | 0 | 17 852 | 12 198 | 21 115 | 0 | 57 609 | | ^ hu ^ Hungarian | 6 922 | 8 | 0 | 17 852 | 12 198 | 21 115 | 0 | 58 095 | |
^ is ^ Icelandic| 0 | 0 | 0 | 0 | 0 | 1 581 | 0 | 1 581 | | ^ is ^ Icelandic | 0 | 0 | 0 | 0 | 0 | 1 581 | 0 | 1 581 | |
^ it ^ Italian | 15 741 | 1 252 | 2 747 | 23 771 | 15 494 | 14 700 | 684 | 74 389 | | ^ it ^ Italian | 16 384 | 1 389 | 2 747 | 23 771 | 15 494 | 14 700 | 684 | 75 169 | |
^ ja ^ Japanese | 3 147 | 0 | 0 | 0 | 0 | 477 | 0 | 3 624 | | ^ ja ^ Japanese | 3 491 | 2 | 0 | 0 | 0 | 477 | 0 | 3 970 | |
^ lt ^ Lithuanian| 502 | 0 | 0 | 17 316 | 11 213 | 558 | 471 | 30 059 | | ^ lt ^ Lithuanian | 502 | 0 | 0 | 17 316 | 11 213 | 558 | 471 | 30 059 | |
^ lv ^ Latvian | 3 031 | 0 | 0 | 17 522 | 11 682 | 280 | 537 | 33 052 | | ^ lv ^ Latvian | 3 437 | 0 | 0 | 17 522 | 11 682 | 280 | 537 | 33 458 | |
^ mk ^ Macedonian | 8 881 | 0 | 0 | 0 | 0 | 1 877 | 0 | 10 758 | | ^ mk ^ Macedonian | 8 881 | 0 | 0 | 0 | 0 | 1 877 | 0 | 10 758 | |
^ ms ^ Malay | 0 | 0 | 0 | 0 | 0 | 3 521 | 0 | 3 521 | | ^ ms ^ Malay | 0 | 0 | 0 | 0 | 0 | 3 521 | 0 | 3 521 | |
^ mt ^ Maltese | 0 | 0 | 0 | 13 935 | 0 | 0 | 0 | 13 935 | | ^ mt ^ Maltese | 0 | 0 | 0 | 13 935 | 0 | 0 | 0 | 13 935 | |
^ nl ^ Dutch | 16 691 | 813 | 2 953 | 23 416 | 15 558 | 29 373 | 717 | 89 520 | | ^ nl ^ Dutch | 17 769 | 812 | 2 953 | 23 416 | 15 558 | 29 373 | 717 | 90 598 | |
^ no ^ Norwegian | 7 818 | 0 | 0 | 0 | 0 | 0 | 722 | 8 540 | | ^ no ^ Norwegian | 7 851 | 0 | 0 | 0 | 0 | 0 | 724 | 8 575 | |
^ pl ^ Polish | 27 669 | 0 | 2 380 | 19 604 | 12 817 | 26 576 | 583 | 89 630 | | ^ pl ^ Polish | 28 112 | 0 | 2 380 | 19 604 | 12 817 | 26 576 | 583 | 90 072 | |
^ pt ^ Portuguese | 6 245 | 554 | 2 782 | 24 598 | 15 193 | 41 468 | 706 | 91 546 | | ^ pt ^ Portuguese | 6 943 | 739 | 2 782 | 24 598 | 15 193 | 41 468 | 706 | 92 429 | |
^ rn ^ Romani | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | | ^ rn ^ Romani | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | |
^ ro ^ Romanian | 4 219 | 0 | 2 738 | 8 092 | 9 446 | 34 128 | 0 | 58 622 | | ^ ro ^ Romanian | 4 219 | 0 | 2 738 | 8 092 | 9 446 | 34 128 | 0 | 58 622 | |
^ ru ^ Russian | 10 510 | 3 984 | 0 | 0 | 0 | 6 887 | 565 | 21 946 | | ^ ru ^ Russian | 10 549 | 4 302 | 0 | 0 | 0 | 6 887 | 565 | 22 303 | |
^ sk ^ Slovak | 8 543 | 0 | 0 | 18 399 | 12 727 | 5 133 | 561 | 45 363 | | ^ sk ^ Slovak | 8 596 | 0 | 0 | 18 399 | 12 727 | 5 133 | 561 | 45 416 | |
^ sl ^ Slovene | 4 097 | 0 | 0 | 18 515 | 12 241 | 17 035 | 0 | 51 888 | | ^ sl ^ Slovene | 4 354 | 0 | 0 | 18 515 | 12 241 | 17 035 | 0 | 52 144 | |
^ sq ^ Albanian | 0 | 0 | 0 | 0 | 0 | 2 003 | 0 | 2 003 | | ^ sq ^ Albanian | 0 | 0 | 0 | 0 | 0 | 2 003 | 0 | 2 003 | |
^ sr ^ Serbian | 12 014 | 0 | 0 | 0 | 0 | 20 727 | 0 | 32 741 | | ^ sr ^ Serbian | 12 356 | 0 | 0 | 0 | 0 | 20 727 | 0 | 33 082 | |
^ sv ^ Swedish | 17 590 | 0 | 0 | 19 542 | 13 784 | 14 666 | 638 | 66 220 | | ^ sv ^ Swedish | 17 877 | 0 | 0 | 19 542 | 13 784 | 14 666 | 638 | 66 507 | |
^ tr ^ Turkish | 0 | 0 | 0 | 0 | 0 | 21 190 | 0 | 21 190 | | ^ tr ^ Turkish | 0 | 0 | 0 | 0 | 0 | 21 190 | 0 | 21 190 | |
^ uk ^ Ukrainian | 12 172 | 0 | 0 | 0 | 0 | 244 | 596 | 13 011 | | ^ uk ^ Ukrainian | 12 712 | 0 | 0 | 0 | 0 | 244 | 596 | 13 551 | |
^ vi ^ Vietnamese | 0 | 0 | 0 | 0 | 0 | 1 474 | 0 | 1 474 | | ^ vi ^ Vietnamese | 0 | 0 | 0 | 0 | 0 | 1 474 | 0 | 1 474 | |
^ zh ^ Chinese | 202 | 240 | 0 | 0 | 0 | 2 247 | 0 | 2 689 | | ^ zh ^ Chinese | 202 | 604 | 0 | 0 | 0 | 2 247 | 0 | 3 054 | |
^ **Subtotal** ^ | 348 770 | 27 617 | 24 658 | 406 444 | 263 855 | 489 146 | 11 505 | 1 571 991 | | ^ **Subtotal** ^ | 361 991 | 30 552 | 24 658 | 406 445 | 263 854 | 489 143 | 11 507 | 1 588 151 | |
^ cs ^ Czech | 117 606 | 4 351 | 2 310 | 19 085 | 12 908 | 50 604 | 562 | 207 426 | | ^ cs ^ Czech | 119 933 | 4 712 | 2 310 | 19 085 | 12 908 | 50 604 | 562 | 210 114 | |
^ **TOTAL** ^ | 466 376 | 31 968 | 26 968 | 425 529 | 276 763 | 539 750 | 12 067 | 1 779 417 | | ^ **TOTAL** ^ | 481 925 | 35 264 | 26 968 | 425 530 | 276 763 | 539 747 | 12 069 | 1 798 266 | |
| |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. | N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. |
| |
^ Language ^ Tags ^ Lemmas ^ Brief description ^ Detailed description ^ Tags in the corpus ^ Tool ^ | ^ Language ^ Tags ^ Lemmas ^ Brief description ^ Detailed description ^ Tags in the corpus ^ Tool ^ |
^ Belarusian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://universaldependencies.org/be/index.html#morphology|in English]]%%****%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_be&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | | ^ Belarusian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://universaldependencies.org/be/index.html#morphology|in English]]%%****%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~ju0ayEyoeIOi|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | |
^ Bulgarian | ✔ | ✔ | [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]] | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/BTB-TR03_BulTreeBank_morphosyntactic_tag.pdf|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_bg&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Bulgarian | ✔ | ✔ | [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]] | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/BTB-TR03_BulTreeBank_morphosyntactic_tag.pdf|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~b6IUUoMyUs8O|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Catalan | ✔ | ✔ | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_ca&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Catalan | ✔ | ✔ | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~cOI6eWQG0c8O|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Chinese | ✔ | | [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]] | [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_zh&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]] | | ^ Chinese | ✔ | | [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]] | [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~uwCay4cSYSy2|list]] | [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]] | |
^ Croatian | ✔ | ✔ | [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]] | [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_hr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | | ^ Croatian | ✔ | ✔ | [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]] | [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~CeqE4wiqmIoA|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | |
^ Czech | ✔ | ✔ | [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]] | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_cs&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] | | ^ Czech | ✔ | ✔ | [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]] | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~wK68uwI0uWiW|list]] | [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] | |
^ Dutch | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_nl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Dutch | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~KSoiyk0CuCCc|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ English | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_en&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ English | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~SYU20meuus0a|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus|in Estonian and English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_et&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus|in Estonian and English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~mWSCSIKm8OcY|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Finnish | ✔ | ✔ | [[https://www.sketchengine.co.uk/finntreebank|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_fi&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] | | ^ Finnish | ✔ | ✔ | [[https://www.sketchengine.co.uk/finntreebank|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~6iw6q2e06KcI|list]] |[[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] | |
^ French | ✔ | ✔ | [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_fr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ French | ✔ | ✔ | [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~m6aC4MMkssms|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ German | ✔ | ✔ | [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]] %%**%%) | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_de&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | | ^ German | ✔ | ✔ | [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]] %%**%%) | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~u4ISOKym04am|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | |
^ Hungarian | ✔ | | | [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_hu&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | | ^ Hungarian | ✔ | | | [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~jSyOE2A2KKsQ|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | |
^ Icelandic | ✔ | ✔ | [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]] | [[http://nlp.cs.ru.is/pdf/Tagset.pdf|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_is&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | | ^ Icelandic | ✔ | ✔ | [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]] | [[http://nlp.cs.ru.is/pdf/Tagset.pdf|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~bEoEKqasyiEe|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | |
^ Italian | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_it&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Italian | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~fmIIwaQqWGqm|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Japanese | ✔ | ✔ | [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_ja&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]] | | ^ Japanese | ✔ | ✔ | [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~hIOk8CYaIMqm|list]] | [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]] | |
^ Latvian | ✔ | ✔ | [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_lv&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] | | ^ Latvian | ✔ | ✔ | [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~GeQ8SSOCouq0|list]] | [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] | |
^ Norwegian | ✔ | ✔ | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_no&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/noklesta/The-Oslo-Bergen-Tagger|Oslo-Bergen Tagger]] | | ^ Norwegian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://universaldependencies.org/no/index.html#morphology|in English]]%%****%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~EcIww4ecGgOG|list]] | [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]] | |
^ Polish | ✔ | ✔ | [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_pl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://sgjp.pl/morfeusz/|Morfeusz]], [[https://github.com/kwrobel-nlp/krnnt|KRNNT]] | | ^ Polish | ✔ | ✔ | [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~McUUoI6EwKaC|list]] |[[http://sgjp.pl/morfeusz/|Morfeusz]], [[https://github.com/kwrobel-nlp/krnnt|KRNNT]] | |
^ Portuguese | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_pt&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Portuguese | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~Fis6w6WSYqYg|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Russian | ✔ | ✔ | [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_ru&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Russian | ✔ | ✔ | [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~Ymey666Kk0qe|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Slovak | ✔ | ✔ | [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]] | [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_sk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] | | ^ Slovak | ✔ | ✔ | [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]] | [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~mKMiKqM6CqO2|list]] | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] | |
^ Slovene | ✔ | ✔ | | [[http://nl.ijs.si/jos/msd/html-en/josMSD-en.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_sl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | | ^ Slovene | ✔ | ✔ | | [[http://nl.ijs.si/jos/msd/html-en/josMSD-en.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~FkkKukIsmeue|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | |
^ Serbian | ✔ | ✔ | [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_sr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | | ^ Serbian | ✔ | ✔ | [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~bGMCy2o2EwOM|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | |
^ Spanish | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_es&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Spanish | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~mQYWIgi6yIK4|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Swedish | ✔ | ✔ | [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_sv&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | | ^ Swedish | ✔ | ✔ | [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~tcGEoMWww0oC|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | |
^ Ukrainian | ✔ | ✔ | | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v14_uk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | | ^ Ukrainian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~IKEKEIm2Auug|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | |
| |
| |
Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma. | Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma. |
| |
Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$". | Morphological tags including characters with a special meaning in regular expressions, e.g. ''$'' in the English tag ''wp%%$%%'', must be preceded in queries by a backslash: ''tag=%%"wp\$"%%''. |
=====Structural attributes===== | =====Structural attributes===== |
| |
* [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages | * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages |
* [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík) | * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík) |
* [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička) | |
* [[http://nl2.ijs.si/analyze/|totale]] for Slovene (until Release 11, thanks to Tomaž Erjavec) | * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (until Release 11, thanks to Tomaž Erjavec) |
* [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] for German | * [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] for German |
When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: | When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: |
| |
Rosen, A., Vavřín, M., Zasina, A. J. (2022). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 14 of 31 January 2022//. Institute of the Czech National Corpus, Charles University, Prague 2022. Available on-line: https://kontext.korpus.cz/ | Rosen, A., Vavřín, M., Zasina, A. J. (2022). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 15 of 11 November 2022//. Institute of the Czech National Corpus, Charles University, Prague 2022. Available on-line: https://kontext.korpus.cz/ |
| |
</WRAP> | </WRAP> |