AplikaceAplikace
Nastavení

InterCorp Release 16ud – Universal Dependencies

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 154 512 254 363 685 460 464 653 933 5 840 602 221
Number of word forms 124 679 582 272 862 335 386 728 679 4 505 550 764
Structural attributes Number of documents 1 812 33 4 643 338
Number of texts 1 812 162 612 4 643 2 662 665
Number of sentences 10 691 339 50 729 559 28 684 678 790 046 584
Further information reference YES
representative NO
publication date 2024
foreign languages 61
tagged languages 47
lemmatized languages 47
syntactically annotated languages 47

Please note that the currently available release of InterCorp v16ud is meant mainly for testing and includes only the Core part (see texts_in_the_corpus below).

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Alexandr Rosen if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

Main features of release 16ud

  • For a detailed description of UD as used in the annotation of InterCorp see the Universal Dependencies entry in the glossary.
  • After 13ud, 16ud is the second release of InterCorp featuring linguistic annotation according to the Universal Dependencies scheme.
  • Release 16ud is the first CNC corpus to feature the metrics of syntactic complexity and lexical diversity.
  • In release 16ud, out of the total number of 62 languages ​​(including Czech), 47 are linguistically annotated; in addition, all such languages ​​are syntactically annotated.
  • Texts are annotated in the same way in all languages, according to the UD standard (Universal Dependencies).
  • Annotation was performed for all languages ​​by UDPipe, based on the data created in the UD project.1)

Texts in the corpus

InterCorp release 16ud contains the same texts as InterCorp release 16. They differ only in linguistic annotation. However, the token and word count data in release 13ud may differ slightly due to a different tokenization method.

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Moreover, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16 published in October 2023 is 387 mil. words in the aligned foreign language texts in the core part and 4 506 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections


Setup of the parallel corpus – the core


Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Bible Total
af Afrikaans 0 0 0 0 0 136 0 136
ar Arabic 34 384 0 0 0 126 157 0 126 576
be Belarusian 7 131 0 0 0 0 0 0 7 131
bg Bulgarian 7 068 0 0 13 577 9 083 165 092 0 194 820
bn Bengali 0 0 0 0 0 1 554 0 1 554
br Breton 0 0 0 0 0 98 0 98
bs Bosnian 0 0 0 0 0 58 758 0 58 758
ca Catalan 10 112 0 0 0 0 2 735 736 13 582
cs Czech 124 680 4 717 2 312 19 214 12 917 233 139 563 397 542
da Danish 9 548 0 0 20 313 13 916 71 825 657 116 259
de German 40 679 5 067 2 483 20 610 13 089 98 566 724 181 219
el Greek 0 0 0 23 853 15 404 162 561 0 201 818
en English 42 395 5 273 2 670 22 902 15 576 280 335 730 369 882
eo Esperanto 0 0 0 0 0 226 0 226
es Spanish 30 661 6 074 2 859 26 262 16 249 223 134 0 305 240
et Estonian 79 0 0 14 896 10 899 54 514 0 80 388
eu Basque 0 0 0 0 0 3 022 0 3 022
fa Persian 0 0 0 0 0 33 167 0 33 167
fi Finnish 6 959 0 0 15 269 10 108 90 471 543 123 349
fr French 24 361 5 896 3 046 26 200 17 179 181 433 764 258 879
gl Galician 0 0 0 0 0 623 0 623
he Hebrew 0 0 0 0 0 130 143 0 130 143
hi Hindi 409 0 0 0 0 432 0 841
hr Croatian 24 529 0 0 0 0 137 966 571 163 066
hs Upper Sorbian 466 0 0 0 0 0 0 466
hu Hungarian 6 921 8 0 17 852 12 198 141 691 0 178 670
hy Armenian 0 0 0 0 0 24 0 24
id Indonesian 0 0 0 0 0 38 343 0 38 343
is Icelandic 0 0 0 0 0 7 375 0 7 375
it Italian 18 086 1 389 2 747 23 771 15 494 163 622 684 225 793
ja Japanese 3 818 2 0 0 0 12 485 0 16 305
ka Georgian 0 0 0 0 0 889 0 889
kk Kazakh 0 0 0 0 0 14 0 14
ko Korean 0 0 0 0 0 5 980 0 5 980
lt Lithuanian 696 0 0 17 316 11 213 5 269 471 34 964
lv Latvian 3 636 0 0 17 533 11 682 2 053 537 35 441
mk Macedonian 8 881 0 0 0 0 15 595 0 24 476
ml Malayalam 0 0 0 0 0 1 281 0 1 281
ms Malay 0 0 0 0 0 7 939 0 7 939
mt Maltese 0 0 0 13 935 0 0 0 13 935
nl Dutch 18 782 812 2 953 23 416 15 558 170 979 717 233 217
no Norwegian 8 221 0 0 0 0 39 807 724 48 752
pl Polish 28 597 0 2 380 19 604 12 817 169 498 583 233 480
pt Portuguese 7 285 739 2 782 24 598 15 193 229 515 706 280 818
rn Romani 14 0 0 0 0 0 0 14
ro Romanian 4 219 0 2 738 8 092 9 446 212 396 0 236 890
ru Russian 12 387 4 302 0 0 0 104 609 565 121 864
si Sinhala 0 0 0 0 0 2 346 0 2 346
sk Slovak 8 586 0 0 18 399 12 727 34 581 561 74 854
sl Slovene 4 636 0 0 18 515 12 241 83 000 0 118 392
sq Albanian 0 0 0 0 0 9 351 0 9 351
sr Serbian 12 706 0 0 0 0 152 636 0 165 342
sv Swedish 19 740 0 0 19 542 13 784 81 548 638 135 252
ta Tamil 0 0 0 0 0 104 0 104
te Telugu 0 0 0 0 0 96 0 96
th Thai 0 0 0 0 0 5 660 0 5 660
tl Tagalog 0 0 0 0 0 38 0 38
tr Turkish 0 0 0 0 0 149 892 0 149 892
uk Ukraininan 14 849 0 0 0 0 2 938 596 18 382
ur Urdu 0 0 0 0 0 158 0 158
vi Vietnamese 0 0 0 0 0 22 298 0 22 298
zh Chinese 238 838 0 0 0 71 331 0 72 407
TOTAL 511 408 35 503 26 971 425 670 276 772 4 001 428 12 069 5 289 821

N.B. 1: Languages printed in italics have no linguistic annotation.

N.B. 2: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Number of texts in the Core

Language Number of texts including originals
ar Arabic 3 1
be Belarusian 108 14
bg Bulgarian 87 19
ca Catalan 92 1
cs Czech 1 812 368
da Danish 93 9
de German 471 163
en English 422 271
es Spanish 355 142
et Estonian 1 0
fi Finnish 112 36
fr French 277 126
hi Hindi 7 2
hr Croatian 324 37
hs Upper Sorbian 13 5
hu Hungarian 89 1
it Italian 171 26
ja Japanese 35 15
lt Lithuanian 23 4
lv Latvian 73 15
mk Macedonian 108 4
nl Dutch 215 52
no Norwegian 102 23
pl Polish 348 54
pt Portuguese 87 24
rn Romani 2 2
ro Romanian 45 5
ru Russian 160 37
sk Slovak 165 62
sl Slovene 73 25
sr Serbian 148 13
sv Swedish 232 101
uk Ukrainian 199 8
zh Chinese 3 3
TOTAL 6 455 1 668

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • Parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Linguistic annotation

* UDPipe (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel)

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Nádvorníková, O., Rosen, A., Šimík, B., Vavřín, M., Zasina, A. J. (2024). The InterCorp Corpus – Czech2), version 16ud of ?? June 2024. Institute of the Czech National Corpus, Charles University, Prague 2024. Available on-line: https://kontext.korpus.cz/

See also

1)
The tool uses all data for the given language, ie all treebanks listed on https://lindat.mff.cuni.cz/services/udpipe/UDPipe. Annotation of this release used the following models: TODO!!! arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.
2)
Insert languages actually used.