Next revision | Previous revision |
en:cnk:intercorp:verze14 [2022/01/14 15:16] – created alexandrrosen | en:cnk:intercorp:verze14 [2024/04/18 16:00] (current) – [Morphosyntactic annotation] michalkren |
---|
~~NOTOC~~ | |
====== InterCorp Release 14 ====== | ====== InterCorp Release 14 ====== |
| |
numbers: TODO! | |
| |
^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ | ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ |
^ Positions ^ Number of tokens | 141,032,521 | 116,673,043 | 394,042,551 | 1,550,071,364 | | ^ Positions ^ Number of tokens | 145,640,866 | 116,673,038 | 418,967,492 | 1,548,425,287 | |
^ ::: ^ Number of word forms | 113,838,505 | 89,819,773 | 327,968,369 | 1,223,270,610 | | ^ ::: ^ Number of word forms | 117,606,467 | 89,819,772 | 348,771,933 | 1,223,221,264 | |
^ Structural attributes ^ Number of documents | 1,657 | 30 | 3,993 | 282 | | ^ Structural attributes ^ Number of documents | 1,708 | 30 | 4,220 | 282 | |
^ ::: ^ Number of texts | 1,657 | 111,951 | 3,993 | 1,843,528 | | ^ ::: ^ Number of texts | 1,708 | 111,951 | 4,220 | 1,843,528 | |
^ ::: ^ Number of sentences | 9,782,001 | 13,606,183 | 24,305,621 | 143,195,566 | | ^ ::: ^ Number of sentences | 10,095,074 | 136,606,183 | 25,872,393 | 143,195,566 | |
^ Further information ^ reference | YES ^^^^ | ^ Further information ^ reference | YES ^^^^ |
^ ::: ^ representative | NO ^^^^ | ^ ::: ^ representative | NO ^^^^ |
| |
New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). | New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). |
===== References ===== | |
| |
If you publish results based on InterCorp we would appreciate a link to the project site [[https://intercorp.korpus.cz/|www.intercorp.korpus.cz]]. In your scientific publications please cite the following paper: | |
| |
<WRAP round info 50%> | |
Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427 | |
([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). | |
| |
For more references see the [[https://www.korpus.cz/biblio|repository of bibliographical items based on the CNC]]. All references to work based on InterCorp are welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details. | |
| |
When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: | |
| |
Rosen, A., Vavřín, M., Zasina, A. J. (2020). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 13 of 1 November 2020//. Institute of the Czech National Corpus, Charles University, Prague 2020. Available on-line: https://kontext.korpus.cz/ | |
| |
</WRAP> | |
===== Texts in the corpus ===== | ===== Texts in the corpus ===== |
| |
These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. | These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. |
| |
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words. | Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 14 published in January 2022 is 349 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 118 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words. |
| |
| |
[{{:cnk:intercorp:intercorp_wordcounts_v13.png|Setup of the parallel corpus – the core and collections}}] \\ | [{{:cnk:intercorp:intercorp_wordcounts_v14.png|Setup of the parallel corpus – the core and collections}}] \\ |
| |
[{{:cnk:intercorp:intercorp_wordcounts2_v13.png|Setup of the parallel corpus – the core}}] \\ | [{{:cnk:intercorp:intercorp_wordcounts2_v14.png|Setup of the parallel corpus – the core}}] \\ |
| |
[{{:cnk:intercorp:intercorp_wordcounts3_v13.png|Setup of the parallel corpus – collections}}] | [{{:cnk:intercorp:intercorp_wordcounts3_v14.png|Setup of the parallel corpus – collections}}] |
| |
===== Corpus size in thousands of words ===== | ===== Corpus size in thousands of words ===== |
^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ | ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ |
^ ar ^ Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | | ^ ar ^ Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | |
^ be ^ Belarusian | 5,718 | 0 | 0 | 0 | 0 | 0 | 0 | 5,718 | | ^ be ^ Belarusian| 6 094 | 0 | 0 | 0 | 0 | 0 | 0 | 6 094 | |
^ bg ^ Bulgarian | 7,068 | 0 | 0 | 13,577 | 9,083 | 0 | 0 | 29,728 | | ^ bg ^ Bulgarian | 7 068 | 0 | 0 | 13 577 | 9 083 | 0 | 0 | 29 728 | |
^ ca ^ Catalan | 7,938 | 0 | 0 | 0 | 0 | 0 | 736 | 8,674 | | ^ ca ^ Catalan | 8 920 | 0 | 0 | 0 | 0 | 0 | 736 | 9 656 | |
^ da ^ Danish | 7,136 | 0 | 0 | 20,313 | 13,916 | 14,429 | 657 | 56,451 | | ^ da ^ Danish | 7 576 | 0 | 0 | 20 313 | 13 916 | 14 429 | 657 | 56 891 | |
^ de ^ German | 37,633 | 4,704 | 2,483 | 20,610 | 13,088 | 8,392 | 724 | 87,634 | | ^ de ^ German | 38 475 | 4 704 | 2 483 | 20 610 | 13 088 | 8 392 | 724 | 88 476 | |
^ el ^ Greek | 0 | 0 | 0 | 23,853 | 15,404 | 23,709 | 0 | 62,966 | | ^ el ^ Greek | 0 | 0 | 0 | 23 853 | 15 404 | 23 709 | 0 | 62 966 | |
^ en ^ English | 33,569 | 4,856 | 2,670 | 22,902 | 15,576 | 52,106 | 730 | 132,409 | | ^ en ^ English | 36 198 | 4 856 | 2 670 | 22 902 | 15 576 | 52 106 | 730 | 135 038 | |
^ es ^ Spanish | 26,554 | 5,614 | 2,859 | 26,262 | 16,249 | 36,650 | 0 | 114,187 | | ^ es ^ Spanish | 28 115 | 5 614 | 2 859 | 26 262 | 16 249 | 36 650 | 0 | 115 748 | |
^ et ^ Estonian | 0 | 0 | 0 | 14,896 | 10,899 | 10,298 | 0 | 36,093 | | ^ et ^ Estonian | 0 | 0 | 0 | 14 896 | 10 899 | 10 298 | 0 | 36 093 | |
^ fi ^ Finnish | 5,656 | 0 | 0 | 15,269 | 10,108 | 15,047 | 543 | 46,622 | | ^ fi ^ Finnish | 6 226 | 0 | 0 | 15 269 | 10 108 | 15 047 | 543 | 47 192 | |
^ fr ^ French | 19,773 | 5,600 | 3,046 | 26,200 | 17,179 | 25,986 | 764 | 98,547 | | ^ fr ^ French | 21 279 | 5 600 | 3 046 | 26 200 | 17 179 | 25 986 | 764 | 100 054 | |
^ he ^ Hebrew | 0 | 0 | 0 | 0 | 0 | 16,221 | 0 | 16,221 | | ^ he ^ Hebrew | 0 | 0 | 0 | 0 | 0 | 16 221 | 0 | 16 221 | |
^ hi ^ Hindi | 409 | 0 | 0 | 0 | 0 | 0 | 0 | 409 | | ^ hi ^ Hindi | 409 | 0 | 0 | 0 | 0 | 0 | 0 | 409 | |
^ hr ^ Croatian | 21,923 | 0 | 0 | 0 | 0 | 19,048 | 571 | 41,543 | | ^ hr ^ Croatian | 22 736 | 0 | 0 | 0 | 0 | 19 048 | 571 | 42 356 | |
^ hu ^ Hungarian | 6,444 | 0 | 0 | 17,852 | 12,198 | 21,115 | 0 | 57,609 | | ^ hs ^ Upper Sorbian | 110 | 0 | 0 | 0 | 0 | 0 | 0 | 110 | |
^ is ^ Icelandic | 0 | 0 | 0 | 0 | 0 | 1,581 | 0 | 1,581 | | ^ hu ^ Hungarian | 6 444 | 0 | 0 | 17 852 | 12 198 | 21 115 | 0 | 57 609 | |
^ it ^ Italian | 14,525 | 1,252 | 2,747 | 23,771 | 15,494 | 14,700 | 684 | 73,174 | | ^ is ^ Icelandic| 0 | 0 | 0 | 0 | 0 | 1 581 | 0 | 1 581 | |
^ ja ^ Japanese | 2,189 | 0 | 0 | 0 | 0 | 477 | 0 | 2,666 | | ^ it ^ Italian | 15 741 | 1 252 | 2 747 | 23 771 | 15 494 | 14 700 | 684 | 74 389 | |
^ lt ^ Lithuanian | 421 | 0 | 0 | 17,316 | 11,213 | 558 | 471 | 29,979 | | ^ ja ^ Japanese | 3 147 | 0 | 0 | 0 | 0 | 477 | 0 | 3 624 | |
^ lv ^ Latvian | 2,646 | 0 | 0 | 17,522 | 11,682 | 280 | 537 | 32,667 | | ^ lt ^ Lithuanian| 502 | 0 | 0 | 17 316 | 11 213 | 558 | 471 | 30 059 | |
^ mk ^ Macedonian | 8,881 | 0 | 0 | 0 | 0 | 1,877 | 0 | 10,758 | | ^ lv ^ Latvian | 3 031 | 0 | 0 | 17 522 | 11 682 | 280 | 537 | 33 052 | |
^ ms ^ Malay | 0 | 0 | 0 | 0 | 0 | 3,521 | 0 | 3,521 | | ^ mk ^ Macedonian | 8 881 | 0 | 0 | 0 | 0 | 1 877 | 0 | 10 758 | |
^ mt ^ Maltese | 0 | 0 | 0 | 13,935 | 0 | 0 | 0 | 13,935 | | ^ ms ^ Malay | 0 | 0 | 0 | 0 | 0 | 3 521 | 0 | 3 521 | |
^ nl ^ Dutch | 16,216 | 813 | 2,953 | 23,416 | 15,558 | 29,373 | 717 | 89,045 | | ^ mt ^ Maltese | 0 | 0 | 0 | 13 935 | 0 | 0 | 0 | 13 935 | |
^ no ^ Norwegian | 7,727 | 0 | 0 | 0 | 0 | 0 | 722 | 8,449 | | ^ nl ^ Dutch | 16 691 | 813 | 2 953 | 23 416 | 15 558 | 29 373 | 717 | 89 520 | |
^ pl ^ Polish | 26,200 | 0 | 2,380 | 19,604 | 12,817 | 26,576 | 583 | 88,161 | | ^ no ^ Norwegian | 7 818 | 0 | 0 | 0 | 0 | 0 | 722 | 8 540 | |
^ pt ^ Portuguese | 4,981 | 554 | 2,782 | 24,598 | 15,193 | 41,468 | 706 | 90,282 | | ^ pl ^ Polish | 27 669 | 0 | 2 380 | 19 604 | 12 817 | 26 576 | 583 | 89 630 | |
| ^ pt ^ Portuguese | 6 245 | 554 | 2 782 | 24 598 | 15 193 | 41 468 | 706 | 91 546 | |
^ rn ^ Romani | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | | ^ rn ^ Romani | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | |
^ ro ^ Romanian | 4,219 | 0 | 2,738 | 8,092 | 9,446 | 34,128 | 0 | 58,622 | | ^ ro ^ Romanian | 4 219 | 0 | 2 738 | 8 092 | 9 446 | 34 128 | 0 | 58 622 | |
^ ru ^ Russian | 8,642 | 3,984 | 0 | 0 | 0 | 6,887 | 565 | 20,078 | | ^ ru ^ Russian | 10 510 | 3 984 | 0 | 0 | 0 | 6 887 | 565 | 21 946 | |
^ sk ^ Slovak | 8,543 | 0 | 0 | 18,399 | 12,727 | 5,133 | 561 | 45,363 | | ^ sk ^ Slovak | 8 543 | 0 | 0 | 18 399 | 12 727 | 5 133 | 561 | 45 363 | |
^ sl ^ Slovene | 3,871 | 0 | 0 | 18,528 | 12,251 | 17,061 | 0 | 51,711 | | ^ sl ^ Slovene | 4 097 | 0 | 0 | 18 515 | 12 241 | 17 035 | 0 | 51 888 | |
^ sq ^ Albanian | 0 | 0 | 0 | 0 | 0 | 2,003 | 0 | 2,003 | | ^ sq ^ Albanian | 0 | 0 | 0 | 0 | 0 | 2 003 | 0 | 2 003 | |
^ sr ^ Serbian | 11,582 | 0 | 0 | 0 | 0 | 20,727 | 0 | 32,308 | | ^ sr ^ Serbian | 12 014 | 0 | 0 | 0 | 0 | 20 727 | 0 | 32 741 | |
^ sv ^ Swedish | 15,790 | 0 | 0 | 19,542 | 13,784 | 14,666 | 638 | 64,419 | | ^ sv ^ Swedish | 17 590 | 0 | 0 | 19 542 | 13 784 | 14 666 | 638 | 66 220 | |
^ tr ^ Turkish | 0 | 0 | 0 | 0 | 0 | 21,190 | 0 | 21,190 | | ^ tr ^ Turkish | 0 | 0 | 0 | 0 | 0 | 21 190 | 0 | 21 190 | |
^ uk ^ Ukrainian | 11,459 | 0 | 0 | 0 | 0 | 244 | 596 | 12,299 | | ^ uk ^ Ukrainian | 12 172 | 0 | 0 | 0 | 0 | 244 | 596 | 13 011 | |
^ vi ^ Vietnamese | 0 | 0 | 0 | 0 | 0 | 1,474 | 0 | 1,474 | | ^ vi ^ Vietnamese | 0 | 0 | 0 | 0 | 0 | 1 474 | 0 | 1 474 | |
^ zh ^ Chinese | 127 | 240 | 0 | 0 | 0 | 2,247 | 0 | 2,614 | | ^ zh ^ Chinese | 202 | 240 | 0 | 0 | 0 | 2 247 | 0 | 2 689 | |
^ **Subtotal** ^| 327,887 | 27,616 | 24,658 | 406,459 | 263,864 | 489,169 | 11,504 | 1,551,157 | | ^ **Subtotal** ^ | 348 770 | 27 617 | 24 658 | 406 444 | 263 855 | 489 146 | 11 505 | 1 571 991 | |
^ cs ^ Czech | 113,839 | 4,351 | 2,310 | 19,085 | 12,908 | 50,604 | 562 | 203,658 | | ^ cs ^ Czech | 117 606 | 4 351 | 2 310 | 19 085 | 12 908 | 50 604 | 562 | 207 426 | |
^ **TOTAL** ^| 441,725 | 31,967 | 26,968 | 425,543 | 276,772 | 539,774 | 12,066 | 1,754,815 | | ^ **TOTAL** ^ | 466 376 | 31 968 | 26 968 | 425 529 | 276 763 | 539 750 | 12 067 | 1 779 417 | |
| |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. | N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. |
Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface. | Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface. |
| |
^ Language ^ Tags ^ Lemmas ^ Brief description ^ Detailed description ^ Tags in the corpus ^ Tool ^ | ^ Language ^ Tags ^ Lemmas ^ Brief description ^ Detailed description ^ Tool ^ |
^ Belarusian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://universaldependencies.org/be/index.html#morphology|in English]]%%****%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_be&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | | ^ Belarusian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://universaldependencies.org/be/index.html#morphology|in English]]%%****%%) | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | |
^ Bulgarian | ✔ | ✔ | [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]] | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/BTB-TR03_BulTreeBank_morphosyntactic_tag.pdf|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_bg&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Bulgarian | ✔ | ✔ | [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]] | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/BTB-TR03_BulTreeBank_morphosyntactic_tag.pdf|in English]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Catalan | ✔ | ✔ | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ca&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Catalan | ✔ | ✔ | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Chinese | ✔ | | [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]] | [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_zh&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]] | | ^ Chinese | ✔ | | [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]] | [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]] | [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]] | |
^ Croatian | ✔ | ✔ | [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]] | [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_hr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | | ^ Croatian | ✔ | ✔ | [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]] | [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|in English]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | |
^ Czech | ✔ | ✔ | [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]] | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_cs&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] | | ^ Czech | ✔ | ✔ | [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]] | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] | |
^ Dutch | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_nl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Dutch | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ English | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_en&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ English | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus|in Estonian and English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_et&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus|in Estonian and English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Finnish | ✔ | ✔ | [[https://www.sketchengine.co.uk/finntreebank|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_fi&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] | | ^ Finnish | ✔ | ✔ | [[https://www.sketchengine.co.uk/finntreebank|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] | |
^ French | ✔ | ✔ | [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_fr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ French | ✔ | ✔ | [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html|in English]] | |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ German | ✔ | ✔ | [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]] %%**%%) | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_de&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | | ^ German | ✔ | ✔ | [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]] %%**%%) | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | |
^ Hungarian | ✔ | | | [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_hu&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | | ^ Hungarian | ✔ | | | [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | |
^ Icelandic | ✔ | ✔ | [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]] | [[http://nlp.cs.ru.is/pdf/Tagset.pdf|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_is&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | | ^ Icelandic | ✔ | ✔ | [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]] | [[http://nlp.cs.ru.is/pdf/Tagset.pdf|in English]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | |
^ Italian | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_it&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Italian | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]] | |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Japanese | ✔ | ✔ | [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ja&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]] | | ^ Japanese | ✔ | ✔ | [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]] | | [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]] | |
^ Latvian | ✔ | ✔ | [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_lv&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] | | ^ Latvian | ✔ | ✔ | [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]] | | [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] | |
^ Norwegian | ✔ | ✔ | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_no&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/noklesta/The-Oslo-Bergen-Tagger|Oslo-Bergen Tagger]] | | ^ Norwegian | ✔ | ✔ | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]] | | [[https://github.com/noklesta/The-Oslo-Bergen-Tagger|Oslo-Bergen Tagger]] | |
^ Polish | ✔ | ✔ | [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_pl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://sgjp.pl/morfeusz/|Morfeusz]], [[https://github.com/kwrobel-nlp/krnnt|KRNNT]] | | ^ Polish | ✔ | ✔ | [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] |[[http://sgjp.pl/morfeusz/|Morfeusz]], [[https://github.com/kwrobel-nlp/krnnt|KRNNT]] | |
^ Portuguese | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_pt&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Portuguese | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Russian | ✔ | ✔ | [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ru&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Russian | ✔ | ✔ | [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%%) |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Slovak | ✔ | ✔ | [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]] | [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] | | ^ Slovak | ✔ | ✔ | [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]] | [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]] | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] | |
^ Slovene | ✔ | ✔ | | [[http://nl.ijs.si/jos/msd/html-en/josMSD-en.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | | ^ Slovene | ✔ | ✔ | | [[http://nl.ijs.si/jos/msd/html-en/josMSD-en.html|in English]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | |
^ Serbian | ✔ | ✔ | [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]] | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | | ^ Serbian | ✔ | ✔ | [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | |
^ Spanish | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_es&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Spanish | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Swedish | ✔ | ✔ | [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]] | | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sv&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | | ^ Swedish | ✔ | ✔ | [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]] | | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | |
^ Ukrainian | ✔ | ✔ | | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_uk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | | ^ Ukrainian | ✔ | ✔ | | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | |
| |
| |
| |
Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$". | Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$". |
====Structural attributes==== | =====Structural attributes===== |
| |
^Structure^Attribute^Description^Values^ | ^Structure^Attribute^Description^Values^ |
* [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar]] for Chinese (thanks to Vlastimil Dobečka) | * [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar]] for Chinese (thanks to Vlastimil Dobečka) |
| |
| ===== How to cite ===== |
| |
| If you publish results based on InterCorp we would appreciate a link to the project site [[https://intercorp.korpus.cz/|www.intercorp.korpus.cz]]. In your scientific publications please cite the following paper: |
| |
| <WRAP round info 50%> |
| Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427 |
| ([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). |
| |
| For more references see the [[https://www.korpus.cz/biblio|repository of bibliographical items based on the CNC]]. All references to work based on InterCorp are welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details. |
| |
| When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: |
| |
| Rosen, A., Vavřín, M., Zasina, A. J. (2022). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 14 of 31 January 2022//. Institute of the Czech National Corpus, Charles University, Prague 2022. Available on-line: https://kontext.korpus.cz/ |
| |
| </WRAP> |
| |
====== See also ====== | ===== See also ===== |
| |
<WRAP round box 51%> | <WRAP round box 51%> |
[[en:cnk:intercorp|InterCorp]] • [[en:cnk:intercorp:verze12|Version 12]] • [[en:cnk:intercorp:verze11|Version 11]] • [[en:cnk:intercorp:verze10|Version 10]] • [[en:cnk:intercorp:verze9|Version 9]] • [[en:cnk:intercorp:verze8|Version 8]] • [[en:cnk:intercorp:verze7|Version 7]] • [[en:cnk:intercorp:verze6|Version 6]] • [[en:cnk:intercorp:verze5|Version 5]] • [[en:cnk:intercorp:verze4|Verze 4]] • [[en:cnk:intercorp:verze3|Version 3]] • [[en:cnk:intercorp:historie|Version history]] | [[en:cnk:intercorp|InterCorp]] • [[en:cnk:intercorp:verze13ud|Version 13ud]] • [[en:cnk:intercorp:verze13|Version 13]] • [[en:cnk:intercorp:verze12|Version 12]] • [[en:cnk:intercorp:verze11|Version 11]] • [[en:cnk:intercorp:verze10|Version 10]] • [[en:cnk:intercorp:verze9|Version 9]] • [[en:cnk:intercorp:verze8|Version 8]] • [[en:cnk:intercorp:verze7|Version 7]] • [[en:cnk:intercorp:verze6|Version 6]] • [[en:cnk:intercorp:verze5|Version 5]] • [[en:cnk:intercorp:verze4|Verze 4]] • [[en:cnk:intercorp:verze3|Version 3]] • [[en:cnk:intercorp:historie|Version history]] |
| |
See [[https://intercorp.korpus.cz/?lang=en|the original InterCorp site in English]]. | See [[https://intercorp.korpus.cz/?lang=en|the original InterCorp site in English]]. |
</WRAP> | </WRAP> |
| |