Skrýt
Nastavení

InterCorp Release 8

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 105 239 198 117 981 673 233 509 950 1 560 655 498
Number of word forms 84 718 325 89 645 545 194 055 340 1 229 043 791
Structural attributes Number of documents 1 279 5 2 513 89
Number of div 1 279 111 263 2 513 1 849 184
Number of sentences 7 250 794 13 588 082 14 377 637 143 478 514
Further information reference YES
representative NO
publication date 2015
foreign languages 38
tagged languages 20
lemmatized languages 17

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

References

If you publish results based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work using InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M.: Korpus InterCorp – English, German1), version 7 from 19 Dec 2014. Institute of the Czech National Corpus, Charles University, Prague 2014. Available on-line: http://www.korpus.cz

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 8 from May 2015 is 195 mil. words in the aligned foreign language texts in the core part and 1,229 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the sizes in millions of words.

Setup of the parallel corpus – the core and collections
Setup of the parallel corpus – the core
Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Total
ar Arabic 34 0 0 0 0 0 34
be Belarusian 2 152 0 0 0 0 0 2 152
bg Bulgarian 5 240 0 0 13 816 9 083 0 28 140
ca Catalan 4 632 0 0 0 0 0 4 632
da Danish 3 016 0 0 21 679 13 915 14 429 53 042
de German 27 681 3 725 2 482 21 723 13 089 8 366 77 069
el Greek 0 0 0 25 069 15 403 23 714 64 187
en English 15 488 3 818 2 670 24 207 15 580 52 101 113 865
es Spanish 17 475 4 324 2 816 27 001 15 885 36 378 103 882
et Estonian 0 0 0 15 962 10 899 10 296 37 158
fi Finnish 3 426 0 0 16 455 10 175 15 097 45 154
fr French 9 170 4 393 2 928 27 351 17 178 25 961 86 983
he Hebrew 0 0 0 0 0 16 221 16 221
hi Hindu 408 0 0 0 0 0 408
hr Croatian 15 479 0 0 0 0 19 092 34 572
hu Hungarian 5 387 0 0 19 176 12 306 21 239 58 110
is Icelandic 0 0 0 0 0 1 584 1 584
it Italian 7 247 651 2 707 24 849 15 489 14 653 65 599
ja Japanese 0 0 0 0 0 113 113
lt Lithuanian 358 0 0 18 392 11 212 557 30 521
lv Latvian 1 336 0 0 18 744 11 688 280 32 050
mk Macedonian 3 741 0 0 0 0 1 877 5 619
ms Malay 0 0 0 0 0 3 520 3 520
mt Maltese 0 0 0 14 133 0 0 14 133
nl Dutch 9 961 313 2 955 24 746 15 563 29 362 82 903
no Norwegian 4 815 0 0 0 0 0 4 815
pl Polish 17 516 0 2 378 20 627 12 811 26 572 79 905
pt Portuguese 2 393 369 2 999 28 602 16 484 43 391 94 241
ro Romanian 3 432 0 2 737 8 199 9 446 34 128 57 944
ru Russian 3 337 3 174 0 0 0 6 885 13 397
sk Slovak 7 401 0 0 19 222 12 734 5 134 44 493
sl Slovenian 900 0 0 19 645 12 240 17 024 49 810
sq Albanian 0 0 0 0 0 2 003 2 003
sr Serbian 8 823 0 0 0 0 20 776 29 600
sv Swedish 8 138 0 0 20 585 13 840 14 693 57 258
tr Turkish 0 0 0 0 0 21 190 21 190
uk Ukrainian 5 054 0 0 0 0 246 5 300
vi Vietnamese 0 0 0 0 0 1 473 1 473
Subtotal 194 055 20 769 24 676 430 195 265 029 488 372 1 423 098
cs Czech 84 718 3 416 2 315 20 303 12 922 50 688 174 363
TOTAL 278 773 24 185 26 991 450 498 277 951 539 060 1 597 462

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language Tags Lemmas Brief description Detailed description Tool
Bulgarian in English TreeTagger
Czech in Czech in English2) in English Morče
Dutch in Dutch TreeTagger
English in English in English + additions TreeTagger
Estonian in Estonian and English TreeTagger
Finnish in English3) OMorFi+HunPOS
French in English TreeTagger
German in English4) in German RFTagger
Hungarian in English HunPos
Icelandic IceStagger
Italian in English TreeTagger
Lithuanian in Czech and English in English Author: Vidas Daudaravičius
Norwegian in English in Norwegian analyzer, tagger
Polish in English in Polish in English Morfeusz, TaKIPI
Portuguese Spanish TreeTagger
Russian in English in English5) TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Slovene English totale
Spanish in English TreeTagger
Swedish Stagger

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. “$” in the English tag “wp$”, must be preceded in queries by a backslash: tag=“wp\$”.

Structural attributes

StructureAttributeDescriptionValues
docdoc.idunique document identifiertext
doc.langlanguagear / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh
doc.versionversionnumber
doc.wordcountdocument size in wordsnumber
divdiv.idtext identificationauthor's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE
div.groupdivision inCore / Acquis / Europarl / PressEurop / Subtitles / Syndicate
div.wordcountnumber of wordsnumber
div.authorauthorlast name, first name
div.titlefull titletext
div.publisherpublishertext
div.pubplacepublication placetext
div.pubyearpublication yeardate
div.txtypetext typediscussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles
div.originalis the text an original?Yes / No
div.srclanglanguage of the originalar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh
div.translatortranslatorlast name, first name
div.transsextranslator's genderF / M
div.authsexauthor's genderF / M
pp.idunique paragraph identifiertext
ss.idunique sentence identifiertext

Number of texts in the core of the corpus by languages of the text and languages of the original

Language of the original
↓ Language of the text ar be bg ca cs da de en es fi fr hi hr hu it lt lv mk nl no pl pt ro ru sk sl sr sv uk total other
ar 1 1 1 3
be 3 8 4 13 1 1 1 3 2 1 1 1 39
bg 19 9 1 27 4 2 1 1 2 2 68
ca 1 16 3 12 5 1 2 3 1 1 45 1
cs 1 3 19 1 267 9 134 242 127 24 95 2 26 1 20 1 7 1 30 7 49 21 39 56 3 8 58 6 1257
da 6 9 12 27
de 85 126 65 10 1 4 1 7 1 1 6 3 3 2 3 1 3 5 327
en 25 4 125 3 1 2 1 1 6 5 4 177 1
es 1 25 8 29 126 1 6 7 1 4 2 3 213 1
fi 11 1 1 12 2 25 1 1 1 2 57 1
fr 36 1 10 83 2 1 2 2 137
hi 2 1 1 2 1 7
hr 1 71 15 52 11 2 4 26 6 7 1 3 4 1 1 8 213 2
hu 16 5 23 9 1 3 14 71
it 4 4 21 9 1 3 19 3 1 3 68 1
lt 8 2 2 1 1 2 1 17
lv 22 2 1 1 7 2 1 36
mk 15 1 16 1 1 1 2 1 3 2 2 4 49
nl 24 3 33 7 3 3 30 2 2 3 3 6 119
no 11 5 21 4 1 3 6 2 1 54
pl 36 8 97 10 2 8 2 1 1 3 1 46 4 6 1 5 231 1
pt 6 8 15 29
ro 7 5 12 3 1 1 1 1 1 1 33 3
ru 9 1 22 2 1 1 22 1 3 62 1
sk 55 2 5 1 1 2 56 122 18
sl 7 1 2 1 2 2 15
sr 11 7 33 9 3 7 2 4 3 10 1 5 2 97 3
sv 11 4 23 7 2 1 1 50 99 1
uk 6 1 31 3 5 2 5 3 5 6 67
total 2 6 39 3 810 19 349 950 335 57 241 4 56 2 89 5 18 3 84 22 128 72 119 118 6 26 164 12
  • The table shows number of texts in the core of Intercorp.
  • For each language which has texts in the core, number of texts by languages of the original (written in the caption) are shown. E. g. in Arabian, there is one Arabian, one Czech and one German original text in the core, that is total of three texts in Arabian (see the penultimate column).
  • You can find out in columns, how many original texts in a language written in the caption are translated to other languages. Codes of these languages are in the first column. The last column shows the number of original texts in other languages, which are not in the core of Intercorp.
  • In the diagonal, there is a number of original texts in a given language. E. g. in Hungarian and Romanian, there is none, in Romanian not even a translated one.

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

  • MorfFlex, Morče and LanGr for Czech
  • TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
  • Morfeusz and TaKIPI for Polish
  • HunPOS for Hungarian and other languages
  • Tagger for Slovak (thanks to Radovan Garabík)
  • Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
  • Tagger for Norwegian (thanks to Pavel Vondřička)
  • totale for Slovene (thanks to Tomaž Erjavec)
  • RFTagger for German
  • OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
  • Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)

See also

1)
Insert actually used languages.
2)
There is a helper application to assist you with queries including Czech morphological tags. Click here.
3)
The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].
4)
Within a single tag, semicolon is used instead of comma as a separator of individual morphological categories, e.g. ADJA:Pos:Nom:Sg:Fem.
5)
Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.