====== InterCorp Release 16 ====== ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ ^ Positions ^ Number of tokens | 154 512 254 | 363 685 460 | 464 653 933 | 5 840 602 221 | ^ ::: ^ Number of word forms | 124 679 582 | 272 862 335 | 386 728 679 | 4 505 550 764 | ^ Structural attributes ^ Number of documents | 1 812 | 33 | 4 643 | 338 | ^ ::: ^ Number of texts | 1 812 | 162 612 | 4 643 | 2 662 665 | ^ ::: ^ Number of sentences | 10 691 339 | 50 729 559 | 28 684 678 | 790 046 584 | ^ Further information ^ reference | YES ^^^^ ^ ::: ^ representative | NO ^^^^ ^ ::: ^ publication date | 2023 ^^^^ ^ ::: ^ foreign languages | 61 ^^^^ ^ ::: ^ tagged languages | 27 ^^^^ ^ ::: ^ lemmatized languages | 25 ^^^^ ===== Access to the texts ===== After [[https://www.korpus.cz/signup|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus. A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]]. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[alexandr.rosen@ff.cuni.cz|Alexandr Rosen]] if you are interested. New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). ===== Texts in the corpus ===== The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The choice in the present release includes: * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.voxeurop.eu|VoxEurop]] (formerly PressEurop) * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database * Translations of the Bible These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16 published in October 2023 is 387 mil. words in the aligned foreign language texts in the core part and 4 506 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words. [{{:cnk:intercorp:intercorp_wordcounts_v16.png?1000|Setup of the parallel corpus – the core and collections}}] \\ [{{:cnk:intercorp:intercorp_wordcounts2_v16.png?1000|Setup of the parallel corpus – the core}}] \\ [{{:cnk:intercorp:intercorp_wordcounts3_v16.png?1000|Setup of the parallel corpus – collections}}] ===== Corpus size in thousands of words ===== ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ ^ af ^ Afrikaans | 0 | 0 | 0 | 0 | 0 | 136 | 0 | 136 | ^ ar ^ Arabic | 34 | 384 | 0 | 0 | 0 | 126 157 | 0 | 126 576 | ^ be ^ Belarusian | 7 131 | 0 | 0 | 0 | 0 | 0 | 0 | 7 131 | ^ bg ^ Bulgarian | 7 068 | 0 | 0 | 13 577 | 9 083 | 165 092 | 0 | 194 820 | ^ bn ^ Bengali | 0 | 0 | 0 | 0 | 0 | 1 554 | 0 | 1 554 | ^ br ^ Breton | 0 | 0 | 0 | 0 | 0 | 98 | 0 | 98 | ^ bs ^ Bosnian | 0 | 0 | 0 | 0 | 0 | 58 758 | 0 | 58 758 | ^ ca ^ Catalan | 10 112 | 0 | 0 | 0 | 0 | 2 735 | 736 | 13 582 | ^ cs ^ Czech | 124 680 | 4 717 | 2 312 | 19 214 | 12 917 | 233 139 | 563 | 397 542 | ^ da ^ Danish | 9 548 | 0 | 0 | 20 313 | 13 916 | 71 825 | 657 | 116 259 | ^ de ^ German | 40 679 | 5 067 | 2 483 | 20 610 | 13 089 | 98 566 | 724 | 181 219 | ^ el ^ Greek | 0 | 0 | 0 | 23 853 | 15 404 | 162 561 | 0 | 201 818 | ^ en ^ English | 42 395 | 5 273 | 2 670 | 22 902 | 15 576 | 280 335 | 730 | 369 882 | ^ eo ^ Esperanto | 0 | 0 | 0 | 0 | 0 | 226 | 0 | 226 | ^ es ^ Spanish | 30 661 | 6 074 | 2 859 | 26 262 | 16 249 | 223 134 | 0 | 305 240 | ^ et ^ Estonian | 79 | 0 | 0 | 14 896 | 10 899 | 54 514 | 0 | 80 388 | ^ eu ^ Basque | 0 | 0 | 0 | 0 | 0 | 3 022 | 0 | 3 022 | ^ fa ^ Persian | 0 | 0 | 0 | 0 | 0 | 33 167 | 0 | 33 167 | ^ fi ^ Finnish | 6 959 | 0 | 0 | 15 269 | 10 108 | 90 471 | 543 | 123 349 | ^ fr ^ French | 24 361 | 5 896 | 3 046 | 26 200 | 17 179 | 181 433 | 764 | 258 879 | ^ gl ^ Galician | 0 | 0 | 0 | 0 | 0 | 623 | 0 | 623 | ^ he ^ Hebrew | 0 | 0 | 0 | 0 | 0 | 130 143 | 0 | 130 143 | ^ hi ^ Hindi | 409 | 0 | 0 | 0 | 0 | 432 | 0 | 841 | ^ hr ^ Croatian | 24 529 | 0 | 0 | 0 | 0 | 137 966 | 571 | 163 066 | ^ hs ^ Upper Sorbian | 466 | 0 | 0 | 0 | 0 | 0 | 0 | 466 | ^ hu ^ Hungarian | 6 921 | 8 | 0 | 17 852 | 12 198 | 141 691 | 0 | 178 670 | ^ hy ^ Armenian | 0 | 0 | 0 | 0 | 0 | 24 | 0 | 24 | ^ id ^ Indonesian | 0 | 0 | 0 | 0 | 0 | 38 343 | 0 | 38 343 | ^ is ^ Icelandic | 0 | 0 | 0 | 0 | 0 | 7 375 | 0 | 7 375 | ^ it ^ Italian | 18 086 | 1 389 | 2 747 | 23 771 | 15 494 | 163 622 | 684 | 225 793 | ^ ja ^ Japanese | 3 818 | 2 | 0 | 0 | 0 | 12 485 | 0 | 16 305 | ^ ka ^ Georgian | 0 | 0 | 0 | 0 | 0 | 889 | 0 | 889 | ^ kk ^ Kazakh | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 14 | ^ ko ^ Korean | 0 | 0 | 0 | 0 | 0 | 5 980 | 0 | 5 980 | ^ lt ^ Lithuanian | 696 | 0 | 0 | 17 316 | 11 213 | 5 269 | 471 | 34 964 | ^ lv ^ Latvian | 3 636 | 0 | 0 | 17 533 | 11 682 | 2 053 | 537 | 35 441 | ^ mk ^ Macedonian | 8 881 | 0 | 0 | 0 | 0 | 15 595 | 0 | 24 476 | ^ ml ^ Malayalam | 0 | 0 | 0 | 0 | 0 | 1 281 | 0 | 1 281 | ^ ms ^ Malay | 0 | 0 | 0 | 0 | 0 | 7 939 | 0 | 7 939 | ^ mt ^ Maltese | 0 | 0 | 0 | 13 935 | 0 | 0 | 0 | 13 935 | ^ nl ^ Dutch | 18 782 | 812 | 2 953 | 23 416 | 15 558 | 170 979 | 717 | 233 217 | ^ no ^ Norwegian | 8 221 | 0 | 0 | 0 | 0 | 39 807 | 724 | 48 752 | ^ pl ^ Polish | 28 597 | 0 | 2 380 | 19 604 | 12 817 | 169 498 | 583 | 233 480 | ^ pt ^ Portuguese | 7 285 | 739 | 2 782 | 24 598 | 15 193 | 229 515 | 706 | 280 818 | ^ rn ^ Romani | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | ^ ro ^ Romanian | 4 219 | 0 | 2 738 | 8 092 | 9 446 | 212 396 | 0 | 236 890 | ^ ru ^ Russian | 12 387 | 4 302 | 0 | 0 | 0 | 104 609 | 565 | 121 864 | ^ si ^ Sinhala | 0 | 0 | 0 | 0 | 0 | 2 346 | 0 | 2 346 | ^ sk ^ Slovak | 8 586 | 0 | 0 | 18 399 | 12 727 | 34 581 | 561 | 74 854 | ^ sl ^ Slovene | 4 636 | 0 | 0 | 18 515 | 12 241 | 83 000 | 0 | 118 392 | ^ sq ^ Albanian | 0 | 0 | 0 | 0 | 0 | 9 351 | 0 | 9 351 | ^ sr ^ Serbian | 12 706 | 0 | 0 | 0 | 0 | 152 636 | 0 | 165 342 | ^ sv ^ Swedish | 19 740 | 0 | 0 | 19 542 | 13 784 | 81 548 | 638 | 135 252 | ^ ta ^ Tamil | 0 | 0 | 0 | 0 | 0 | 104 | 0 | 104 | ^ te ^ Telugu | 0 | 0 | 0 | 0 | 0 | 96 | 0 | 96 | ^ th ^ Thai | 0 | 0 | 0 | 0 | 0 | 5 660 | 0 | 5 660 | ^ tl ^ Tagalog | 0 | 0 | 0 | 0 | 0 | 38 | 0 | 38 | ^ tr ^ Turkish | 0 | 0 | 0 | 0 | 0 | 149 892 | 0 | 149 892 | ^ uk ^ Ukraininan | 14 849 | 0 | 0 | 0 | 0 | 2 938 | 596 | 18 382 | ^ ur ^ Urdu | 0 | 0 | 0 | 0 | 0 | 158 | 0 | 158 | ^ vi ^ Vietnamese | 0 | 0 | 0 | 0 | 0 | 22 298 | 0 | 22 298 | ^ zh ^ Chinese | 238 | 838 | 0 | 0 | 0 | 71 331 | 0 | 72 407 | ^ **TOTAL** ^ | 511 408 | 35 503 | 26 971 | 425 670 | 276 772 | 4 001 428 | 12 069 | 5 289 821 | N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. ====Number of texts in the Core==== ^ Language ^^ Number of texts ^ including originals ^ ^ ar ^ Arabic | 3 | 1 | ^ be ^ Belarusian | 108 | 14 | ^ bg ^ Bulgarian | 87 | 19 | ^ ca ^ Catalan | 92 | 1 | ^ cs ^ Czech | 1 812 | 368 | ^ da ^ Danish | 93 | 9 | ^ de ^ German | 471 | 163 | ^ en ^ English | 422 | 271 | ^ es ^ Spanish | 355 | 142 | ^ et ^ Estonian | 1 | 0 | ^ fi ^ Finnish | 112 | 36 | ^ fr ^ French | 277 | 126 | ^ hi ^ Hindi | 7 | 2 | ^ hr ^ Croatian | 324 | 37 | ^ hs ^ Upper Sorbian | 13 | 5 | ^ hu ^ Hungarian | 89 | 1 | ^ it ^ Italian | 171 | 26 | ^ ja ^ Japanese | 35 | 15 | ^ lt ^ Lithuanian | 23 | 4 | ^ lv ^ Latvian | 73 | 15 | ^ mk ^ Macedonian | 108 | 4 | ^ nl ^ Dutch | 215 | 52 | ^ no ^ Norwegian | 102 | 23 | ^ pl ^ Polish | 348 | 54 | ^ pt ^ Portuguese | 87 | 24 | ^ rn ^ Romani | 2 | 2 | ^ ro ^ Romanian | 45 | 5 | ^ ru ^ Russian | 160 | 37 | ^ sk ^ Slovak | 165 | 62 | ^ sl ^ Slovene | 73 | 25 | ^ sr ^ Serbian | 148 | 13 | ^ sv ^ Swedish | 232 | 101 | ^ uk ^ Ukrainian | 199 | 8 | ^ zh ^ Chinese | 3 | 3 | ^ **TOTAL** ^ | 6 455 | 1 668 | ===== Morphosyntactic annotation ===== Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface. ^ Language ^ Tags ^ Lemmas ^ Brief description ^ Detailed description ^ Tags in the corpus ^ Tool ^ ^ Belarusian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://universaldependencies.org/be/index.html#morphology|in English]]%%****%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~WUgyKq0a2I2I|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | ^ Bulgarian | ✔ | ✔ | [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]] | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/BTB-TR03_BulTreeBank_morphosyntactic_tag.pdf|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~deauEUMQSay2|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Catalan | ✔ | ✔ | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~xIQI46GMkQMc|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Chinese | ✔ | | [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]] | [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~Qy0WEKcyKCAG|list]] | [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]] | ^ Croatian | ✔ | ✔ | [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]] | [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~ve6ySioUWoQo|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | ^ Czech | ✔ | ✔ | [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]] | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~dWMc6cC2mEYI|list]] | [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] | ^ Dutch | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~58AMOGUAOg6I|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ English | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~AoIeKE4AOIoO|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus|in Estonian and English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~OYogQQcMUc86|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Finnish | ✔ | ✔ | [[https://www.sketchengine.co.uk/finntreebank|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~BwiUqc2SoaKY|list]] |[[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] | ^ French | ✔ | ✔ | [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~MEY8qsoECM42|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ German | ✔ | ✔ | [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]] %%**%%) | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~gs4MCm8iuEea|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | ^ Hungarian | ✔ | | | [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~CCeWgGmqmcqi|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | ^ Icelandic | ✔ | ✔ | [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]] | [[http://nlp.cs.ru.is/pdf/Tagset.pdf|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~OSQqSoscsiiG|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | ^ Italian | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~AG82UCM6swiK|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Japanese | ✔ | ✔ | [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~v8EQwWqiygis|list]] | [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]] | ^ Latvian | ✔ | ✔ | [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~NiGIW6iec6eq|list]] | [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] | ^ Norwegian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://universaldependencies.org/no/index.html#morphology|in English]]%%****%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~I6aemQOK8yiU|list]] | [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]] | ^ Polish | ✔ | ✔ | [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~ReKM6qg4Ic8W|list]] |[[http://sgjp.pl/morfeusz/|Morfeusz]], [[https://github.com/kwrobel-nlp/krnnt|KRNNT]] | ^ Portuguese | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~saGaiAI0uEMo|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Russian | ✔ | ✔ | [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~T2sc4y6Uw2WO|list]] |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Slovak | ✔ | ✔ | [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]] | [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~qkQQs4cq2IyG|list]] | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] | ^ Slovene | ✔ | ✔ | | [[http://nl.ijs.si/jos/msd/html-en/josMSD-en.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~jQMEsa8MuCQm|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | ^ Serbian | ✔ | ✔ | [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]] | [[https://www.korpus.cz/kontext/wordlist/result?q=~3C8YOAWM0IIC|list]] | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]] | ^ Spanish | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~twEuIaMu4sSQ|list]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | ^ Swedish | ✔ | ✔ | [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]] | | [[https://www.korpus.cz/kontext/wordlist/result?q=~hOAuiSoQMGQe|list]] | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | ^ Ukrainian | ✔ | ✔ | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%) | [[https://www.korpus.cz/kontext/wordlist/result?q=~iQ0owcu4o2eQ|list]] | [[http://ufal.mff.cuni.cz/udpipe/2|UDPipe]] | %%*%%) The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP]. %%**%%) Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem. %%***%%) Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as "P-". All tags, as used in ther corpus, are listed in the brief description. %%****%%) The tag is in the UD (Universal Dependencies) format, components of the tag are separated by a vertical bar (|), e.g. the form школы in genitive singular is tagged as: ''NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing''. The query can be specified in the same way as for other languages, treating the tag as a string, i.e.\ ''[tag=%%"NOUN.*Case=Gen\|Gender=Fem.*"%%]'' or the tag components can be specified separately: ''[tag=%%"Case=Gen"%% & tag=%%"NOUN"%% & tag=%%"Gender=Fem"%%]'' (the order of categories is not significant). The result is identical in either case. Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. You can have all tags used in the corpus for a given language listed – see the column **Tags in the corpus** in the table above. Or in a page displaying results open the **View/Corpus-specific settings...** menu to check the //tag// option in the **Positional attributes** box and choose the //for each token// option in the **Viewing options** box. Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma. Morphological tags including characters with a special meaning in regular expressions, e.g. ''$'' in the English tag ''wp%%$%%'', must be preceded in queries by a backslash: ''tag=%%"wp\$"%%''. =====Structural attributes===== ^Structure^Attribute^Description^Values^ |doc|doc.id|document identifier| author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP_year / _SUBTITLES / _SYNDICATE_year / _OT / _NT | |text|text.id|text identifier|author's_last_name-shortened_title:0 / _ACQUIS:number / _EUROPARL:number / _PRESSEUROP:number / _SUBTITLES:number / _SYNDICATE_year:name / _OT:book / _NT:book | | |text.author|author|last name, first name| | |text.title|full title|text| | |text.lang|language|ar / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh| | |text.version|version|number| | |text.group|core/collection| Core / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible | | |text.publisher|publisher|text| | |text.pubplace|publication place|text| | |text.pubDateYear|publication year|number| | |text.pubDateMonth|publication month|number| | |text.origyear|original creation year|number| | |text.isbn|ISBN|number| | |text.txtype|text type|discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles / religious | | |text.comment|comment|text| | |text.original|original version?|Yes / No| | |text.srclang|language of the original|ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh| | |text.translator|translator|last name, first name| | |text.transsex|translator's gender|F / M| | |text.authsex|author's gender|F / M| | |text.transcomment|translation comment|text| | |text.collectiontitle|collection title|text| | |text.volume|volume number|number| | |text.pages|number of pages|number| | |text.lang_var|language variety|de-AT / de-CH / de-DE / en-AU / en-CA / en-GB / en-UM / en-US / es-ES / es-MX / es-PE / fr-BE / fr-FR / it-CH / it-IT / nl-BE / nl-NL / pt-BR / pt-PT / sr-RS | | |text.wordcount|number of words|number| |div|div.id|division identifier (Bible)| _NT / _OT:chapter | | |div.type|division type|chapter| |p|p.id|paragraph identifier|doc:text:div:par| |s|s.id|sentence identifier|doc:text:div:par:sent| |hi|hi.rend|typeface|italic / bold / bold italic| |lb|lb.id|verse identifier (Bible)|book:chapter:verse| ===== Acknowledgements ===== We are grateful for the possibility to use the following texts and software: ==== Texts: ==== * The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš. * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to Adrian Barentsen * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] * Newspaper texts in a number of languages from the [[http://www.voxeurop.eu|Presseurop/VoxEurop]] server * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus * Proceedings of the European Parliament from the [[http://www.statmt.org/europarl/|EuroParl]] corpus * Slovak-Czech concordances from the [[http://korpus.juls.savba.sk/|Slovak National Corpus]] * Short stories in a number of languages [[http://www.goethe.de/ins/cz/prj/m89/csindex.htm|My 1989]] from [[http://www.goethe.de/ins/cz/pra/|Goethe Institut]] * A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness * George Orwell's novel //1984// in a number of languages from the [[http://nl.ijs.si/ME/|Multext-East]] corpus * Ukrainian and Polish texts from the [[http://www.domeczek.pl/~polukr/|PolUkr]] corpus * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug & co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] ==== Pre-processing ==== * Parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]] * Sentence splitter for Czech by Pavel Květoň * Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička * Sentence splitter Punkt for all other languages from [[http://www.nltk.org/|Natural Language Toolkit]] ==== Taggers/lemmatizers: ==== * [[http://ufal.mff.cuni.cz/morfflex|MorfFlex]], [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] and [[https://is.cuni.cz/webapps/zzp/download/140018093/?back_id=10|LanGr]] for Czech * [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish * [[http://sgjp.pl/morfeusz/|Morfeusz]] and [[https://github.com/kwrobel-nlp/krnnt|KRNNT]] for Polish * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík) * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (until Release 11, thanks to Tomaž Erjavec) * [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] for German * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter) * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling) * [[https://github.com/clarinsi/reldi-tagger|RelDI tagger]] for Croatian, Serbian((Ljubešić, N., Klubička, F., Željko Agić, and Jazbec, I.-P. (2016). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Calzolari, N. et al., editors, //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)//, Paris, France. European Language Resources Association (ELRA).)) and Slovene((Ljubešić, N. and Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Calzolari, N. et al., editors, //Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)//, Paris, France. European Language Resources Association (ELRA).)) (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]]) * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal) * [[http://ufal.mff.cuni.cz/udpipe|UD Pipe]] for Belarusian and Ukrainian (thanks to Bohdan Moskalevskyi) * [[https://taku910.github.io/mecab/|MeCab]] and [[https://osdn.net/projects/unidic/|Unidic]] for Japanese (thanks to Adam Nohejl) * [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar]] for Chinese (thanks to Vlastimil Dobečka) ===== How to cite ===== If you publish results based on InterCorp we would appreciate a link to the project site [[https://intercorp.korpus.cz/|www.intercorp.korpus.cz]]. In your scientific publications please cite the following paper: Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427 ([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). For more references see the [[https://www.korpus.cz/biblio|repository of bibliographical items based on the CNC]]. All references to work based on InterCorp are welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details. When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: Rosen, A., Vavřín, M., Zasina, A. J. (2022). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 15 of 11 November 2022//. Institute of the Czech National Corpus, Charles University, Prague 2022. Available on-line: https://kontext.korpus.cz/ ===== See also ===== [[en:cnk:intercorp|InterCorp]] • [[en:cnk:intercorp:verze15|Version 15]] • [[en:cnk:intercorp:verze14|Version 14]] • [[en:cnk:intercorp:verze13ud|Version 13ud]] • [[en:cnk:intercorp:verze13|Version 13]] • [[en:cnk:intercorp:verze12|Version 12]] • [[en:cnk:intercorp:verze11|Version 11]] • [[en:cnk:intercorp:verze10|Version 10]] • [[en:cnk:intercorp:verze9|Version 9]] • [[en:cnk:intercorp:verze8|Version 8]] • [[en:cnk:intercorp:verze7|Version 7]] • [[en:cnk:intercorp:verze6|Version 6]] • [[en:cnk:intercorp:verze5|Version 5]] • [[en:cnk:intercorp:verze4|Verze 4]] • [[en:cnk:intercorp:verze3|Version 3]] • [[en:cnk:intercorp:historie|Version history]] See [[https://intercorp.korpus.cz/?lang=en|the original InterCorp site in English]].