====== InterCorp Release 13ud – Universal Dependencies ====== ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ ^ Positions ^ Number of tokens | 141,032,521 | 116,673,043 | 394,042,551 | 1,550,071,364 | ^ ::: ^ Number of word forms | 113,838,505 | 89,819,773 | 327,968,369 | 1,223,270,610 | ^ Structural attributes ^ Number of documents | 1,657 | 30 | 3,994 | 282 | ^ ::: ^ Number of texts | 1,657 | 111,951 | 3,994 | 1,843,528 | ^ ::: ^ Number of sentences | 9,782,002 | 13,606,198 | 24,318,736 | 143,196,252 | ^ Further information ^ reference | YES ^^^^ ^ ::: ^ representative | NO ^^^^ ^ ::: ^ publication date | 2021 ^^^^ ^ ::: ^ foreign languages | 40 ^^^^ ^ ::: ^ tagged languages | 35 ^^^^ ^ ::: ^ lemmatized languages | 35 ^^^^ ^ ::: ^ syntactically annotated languages| 35 ^^^^ ===== Access to the texts ===== After [[https://www.korpus.cz/signup|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus. A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]]. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[alexandr.rosen@ff.cuni.cz|Alexandr Rosen]] if you are interested. New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the [[https://universaldependencies.org|Universal Dependencies]] scheme. ===== Main differences between releases 13 and 13ud ===== * In release 13ud, out of the total number of 41 languages ​​(including Czech), **36 are linguistically annotated**; in addition, all such languages ​​are **syntactically annotated**. * Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org|Universal Dependencies]]). * For a detailed description of UD as used in the annotation of InterCorp see [[en:pojmy:ud|Universal Dependencies]]. * Annotation was performed for all languages ​​by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((The tool uses all data for the given language, ie all treebanks listed on [[https://lindat.mff.cuni.cz/services/udpipe/IUDPipe]]. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.)) ===== Texts in the corpus ===== InterCorp release 13ud contains the **same texts** as InterCorp release 13. They **differ only in linguistic annotation**. However, the token and word count data in release 13ud may differ slightly due to a different tokenization method. The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The choice in the present release includes: * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.voxeurop.eu|VoxEurop]] (formerly PressEurop) * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database * Translations of the Bible These texts have been aligned automatically: search results may include a higher number of misaligned segments. Moreover, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words. [{{:cnk:intercorp:intercorp_wordcounts_v13.png|Setup of the parallel corpus – the core and collections}}] \\ [{{:cnk:intercorp:intercorp_wordcounts2_v13.png|Setup of the parallel corpus – the core}}] \\ [{{:cnk:intercorp:intercorp_wordcounts3_v13.png|Setup of the parallel corpus – collections}}] ===== Corpus size in thousands of words ===== ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ ^ ar ^ Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | ^ be ^ Belarusian | 5,718 | 0 | 0 | 0 | 0 | 0 | 0 | 5,718 | ^ bg ^ Bulgarian | 7,068 | 0 | 0 | 13,577 | 9,083 | 0 | 0 | 29,728 | ^ ca ^ Catalan | 7,938 | 0 | 0 | 0 | 0 | 0 | 736 | 8,674 | ^ da ^ Danish | 7,136 | 0 | 0 | 20,313 | 13,916 | 14,429 | 657 | 56,451 | ^ de ^ German | 37,633 | 4,704 | 2,483 | 20,610 | 13,088 | 8,392 | 724 | 87,634 | ^ el ^ Greek | 0 | 0 | 0 | 23,853 | 15,404 | 23,709 | 0 | 62,966 | ^ en ^ English | 33,569 | 4,856 | 2,670 | 22,902 | 15,576 | 52,106 | 730 | 132,409 | ^ es ^ Spanish | 26,554 | 5,614 | 2,859 | 26,262 | 16,249 | 36,650 | 0 | 114,187 | ^ et ^ Estonian | 0 | 0 | 0 | 14,896 | 10,899 | 10,298 | 0 | 36,093 | ^ fi ^ Finnish | 5,656 | 0 | 0 | 15,269 | 10,108 | 15,047 | 543 | 46,622 | ^ fr ^ French | 19,773 | 5,600 | 3,046 | 26,200 | 17,179 | 25,986 | 764 | 98,547 | ^ he ^ Hebrew | 0 | 0 | 0 | 0 | 0 | 16,221 | 0 | 16,221 | ^ hi ^ Hindi | 409 | 0 | 0 | 0 | 0 | 0 | 0 | 409 | ^ hr ^ Croatian | 21,923 | 0 | 0 | 0 | 0 | 19,048 | 571 | 41,543 | ^ hu ^ Hungarian | 6,444 | 0 | 0 | 17,852 | 12,198 | 21,115 | 0 | 57,609 | ^ //is// ^ //Icelandic// | 0 | 0 | 0 | 0 | 0 | 1,581 | 0 | 1,581 | ^ it ^ Italian | 14,525 | 1,252 | 2,747 | 23,771 | 15,494 | 14,700 | 684 | 73,174 | ^ ja ^ Japanese | 2,189 | 0 | 0 | 0 | 0 | 477 | 0 | 2,666 | ^ lt ^ Lithuanian | 421 | 0 | 0 | 17,316 | 11,213 | 558 | 471 | 29,979 | ^ lv ^ Latvian | 2,646 | 0 | 0 | 17,522 | 11,682 | 280 | 537 | 32,667 | ^ //mk// ^ //Macedonian// | 8,881 | 0 | 0 | 0 | 0 | 1,877 | 0 | 10,758 | ^ //ms// ^ //Malay// | 0 | 0 | 0 | 0 | 0 | 3,521 | 0 | 3,521 | ^ mt ^ Maltese | 0 | 0 | 0 | 13,935 | 0 | 0 | 0 | 13,935 | ^ nl ^ Dutch | 16,216 | 813 | 2,953 | 23,416 | 15,558 | 29,373 | 717 | 89,045 | ^ no ^ Norwegian | 7,727 | 0 | 0 | 0 | 0 | 0 | 722 | 8,449 | ^ pl ^ Polish | 26,200 | 0 | 2,380 | 19,604 | 12,817 | 26,576 | 583 | 88,161 | ^ pt ^ Portuguese | 4,981 | 554 | 2,782 | 24,598 | 15,193 | 41,468 | 706 | 90,282 | ^ //rn// ^ //Romani// | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | ^ ro ^ Romanian | 4,219 | 0 | 2,738 | 8,092 | 9,446 | 34,128 | 0 | 58,622 | ^ ru ^ Russian | 8,642 | 3,984 | 0 | 0 | 0 | 6,887 | 565 | 20,078 | ^ sk ^ Slovak | 8,543 | 0 | 0 | 18,399 | 12,727 | 5,133 | 561 | 45,363 | ^ sl ^ Slovene | 3,871 | 0 | 0 | 18,528 | 12,251 | 17,061 | 0 | 51,711 | ^ //sq// ^ //Albanian// | 0 | 0 | 0 | 0 | 0 | 2,003 | 0 | 2,003 | ^ sr ^ Serbian | 11,582 | 0 | 0 | 0 | 0 | 20,727 | 0 | 32,308 | ^ sv ^ Swedish | 15,790 | 0 | 0 | 19,542 | 13,784 | 14,666 | 638 | 64,419 | ^ tr ^ Turkish | 0 | 0 | 0 | 0 | 0 | 21,190 | 0 | 21,190 | ^ uk ^ Ukrainian | 11,459 | 0 | 0 | 0 | 0 | 244 | 596 | 12,299 | ^ vi ^ Vietnamese | 0 | 0 | 0 | 0 | 0 | 1,474 | 0 | 1,474 | ^ zh ^ Chinese | 127 | 240 | 0 | 0 | 0 | 2,247 | 0 | 2,614 | ^ **Subtotal** ^| 327,887 | 27,616 | 24,658 | 406,459 | 263,864 | 489,169 | 11,504 | 1,551,157 | ^ cs ^ Czech | 113,839 | 4,351 | 2,310 | 19,085 | 12,908 | 50,604 | 562 | 203,658 | ^ **TOTAL** ^| 441,725 | 31,967 | 26,968 | 425,543 | 276,772 | 539,774 | 12,066 | 1,754,815 | N.B. 1: Languages printed in //italics// have no linguistic annotation. N.B. 2: Each Czech text is counted only once, even though it may have more than one foreign counterpart. ===== Acknowledgements ===== We are grateful for the possibility to use the following texts and software: ==== Texts: ==== * The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš. * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to Adrian Barentsen * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] * Newspaper texts in a number of languages from the [[http://www.voxeurop.eu|Presseurop/VoxEurop]] server * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus * Proceedings of the European Parliament from the [[http://www.statmt.org/europarl/|EuroParl]] corpus * Slovak-Czech concordances from the [[http://korpus.juls.savba.sk/|Slovak National Corpus]] * Short stories in a number of languages [[http://www.goethe.de/ins/cz/prj/m89/csindex.htm|My 1989]] from [[http://www.goethe.de/ins/cz/pra/|Goethe Institut]] * A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness * George Orwell's novel //1984// in a number of languages from the [[http://nl.ijs.si/ME/|Multext-East]] corpus * Ukrainian and Polish texts from the [[http://www.domeczek.pl/~polukr/|PolUkr]] corpus * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug & co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] ==== Pre-processing ==== * Parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]] * Sentence splitter for Czech by Pavel Květoň * Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička * Sentence splitter Punkt for all other languages from [[http://www.nltk.org/|Natural Language Toolkit]] ==== Linguistic annotation ==== * [[http://ufal.mff.cuni.cz/udpipe|UDPipe]] (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel) ===== How to cite ===== If you publish results based on InterCorp we would appreciate a link to the project site [[https://intercorp.korpus.cz/|www.intercorp.korpus.cz]]. In your scientific publications please cite the following paper: Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427 ([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). For more references see the [[https://www.korpus.cz/biblio|repository of bibliographical items based on the CNC]]. All references to work based on InterCorp are welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details. When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: Rosen, A., Vavřín, M., Zasina, A. J. (2022). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 13ud of 22 December 2021//. Institute of the Czech National Corpus, Charles University, Prague 2021. Available on-line: https://kontext.korpus.cz/