AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp Release 13ud – Universal Dependencies

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 141,032,521 116,673,043 394,042,551 1,550,071,364
Number of word forms 113,838,505 89,819,773 327,968,369 1,223,270,610
Structural attributes Number of documents 1,657 30 3,994 282
Number of texts 1,657 111,951 3,994 1,843,528
Number of sentences 9,782,002 13,606,198 24,318,736 143,196,252
Further information reference YES
representative NO
publication date 2021
foreign languages 40
tagged languages 35
lemmatized languages 35
syntactically annotated languages 35

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Martin Vavřín if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the Universal Dependencies scheme.

Main differences between releases 13 and 13ud

  • In release 13ud, out of the total number of 41 languages ​​(including Czech), 36 are linguistically annotated; in addition, all such languages ​​are syntactically annotated.
  • Texts are annotated in the same way in all languages, according to the UD standard (Universal Dependencies).
  • For a detailed description of UD as used in the annotation of InterCorp see Universal Dependencies.
  • Annotation was performed for all languages ​​by UDPipe, based on the data created in the UD project.1)

Texts in the corpus

InterCorp release 13ud contains the same texts as InterCorp release 13. They differ only in linguistic annotation. However, the token and word count data in release 13ud may differ slightly due to a different tokenization method.

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections


Setup of the parallel corpus – the core


Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Bible Total
ar Arabic 34 0 0 0 0 0 0 34
be Belarusian 5,718 0 0 0 0 0 0 5,718
bg Bulgarian 7,068 0 0 13,577 9,083 0 0 29,728
ca Catalan 7,938 0 0 0 0 0 736 8,674
da Danish 7,136 0 0 20,313 13,916 14,429 657 56,451
de German 37,633 4,704 2,483 20,610 13,088 8,392 724 87,634
el Greek 0 0 0 23,853 15,404 23,709 0 62,966
en English 33,569 4,856 2,670 22,902 15,576 52,106 730 132,409
es Spanish 26,554 5,614 2,859 26,262 16,249 36,650 0 114,187
et Estonian 0 0 0 14,896 10,899 10,298 0 36,093
fi Finnish 5,656 0 0 15,269 10,108 15,047 543 46,622
fr French 19,773 5,600 3,046 26,200 17,179 25,986 764 98,547
he Hebrew 0 0 0 0 0 16,221 0 16,221
hi Hindi 409 0 0 0 0 0 0 409
hr Croatian 21,923 0 0 0 0 19,048 571 41,543
hu Hungarian 6,444 0 0 17,852 12,198 21,115 0 57,609
is Icelandic 0 0 0 0 0 1,581 0 1,581
it Italian 14,525 1,252 2,747 23,771 15,494 14,700 684 73,174
ja Japanese 2,189 0 0 0 0 477 0 2,666
lt Lithuanian 421 0 0 17,316 11,213 558 471 29,979
lv Latvian 2,646 0 0 17,522 11,682 280 537 32,667
mk Macedonian 8,881 0 0 0 0 1,877 0 10,758
ms Malay 0 0 0 0 0 3,521 0 3,521
mt Maltese 0 0 0 13,935 0 0 0 13,935
nl Dutch 16,216 813 2,953 23,416 15,558 29,373 717 89,045
no Norwegian 7,727 0 0 0 0 0 722 8,449
pl Polish 26,200 0 2,380 19,604 12,817 26,576 583 88,161
pt Portuguese 4,981 554 2,782 24,598 15,193 41,468 706 90,282
rn Romani 14 0 0 0 0 0 0 14
ro Romanian 4,219 0 2,738 8,092 9,446 34,128 0 58,622
ru Russian 8,642 3,984 0 0 0 6,887 565 20,078
sk Slovak 8,543 0 0 18,399 12,727 5,133 561 45,363
sl Slovene 3,871 0 0 18,528 12,251 17,061 0 51,711
sq Albanian 0 0 0 0 0 2,003 0 2,003
sr Serbian 11,582 0 0 0 0 20,727 0 32,308
sv Swedish 15,790 0 0 19,542 13,784 14,666 638 64,419
tr Turkish 0 0 0 0 0 21,190 0 21,190
uk Ukrainian 11,459 0 0 0 0 244 596 12,299
vi Vietnamese 0 0 0 0 0 1,474 0 1,474
zh Chinese 127 240 0 0 0 2,247 0 2,614
Subtotal 327,887 27,616 24,658 406,459 263,864 489,169 11,504 1,551,157
cs Czech 113,839 4,351 2,310 19,085 12,908 50,604 562 203,658
TOTAL 441,725 31,967 26,968 425,543 276,772 539,774 12,066 1,754,815

N.B. 1: Languages printed in italics have no linguistic annotation.

N.B. 2: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • Parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Linguistic annotation

* UDPipe (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel)

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M., Zasina, A. J. (2022). The InterCorp Corpus – Czech2), version 13ud of 22 December 2021. Institute of the Czech National Corpus, Charles University, Prague 2021. Available on-line: https://kontext.korpus.cz/

1)
The tool uses all data for the given language, ie all treebanks listed on https://lindat.mff.cuni.cz/services/udpipe/IUDPipe. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.
2)
Insert languages actually used.