AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 105 239 198 117 981 673 233 509 950 1 560 655 498
Number of word forms 84 718 325 89 645 545 194 055 340 1 229 043 791
Structural attributes Number of documents 1 279 5 2 513 89
Number of div 1 279 111 263 2 513 1 849 184
Number of sentences 7 250 794 13 588 082 14 377 637 143 478 514
Further information reference YES
representative NO
publication date 2015
foreign languages 38
tagged languages 20
lemmatized languages 17

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial in Czech is available here.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

References

We would appreciate a link to the project site www.korpus.cz/intercorp in results of your work based on InterCorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex1), electronic edition at ing entaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work using InterCorp is welcome. See here for details.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 8 from May 2015 is 195 mil. words in the aligned foreign language texts in the core part and 1,229 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the sizes in millions of words.

Setup of the parallel corpus – the core and collections
Setup of the parallel corpus – the core
Setup of the parallel corpus – collections

Corpus size in the number of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Total
ar Arabic 34,325 0 0 0 0 0 34,325
be Belarusian 2,152,724 0 0 0 0 0 2,152,724
bg Bulgarian 5,240,831 0 0 13,816,405 9,083,403 0 28,140,639
ca Catalan 4,632,696 0 0 0 0 0 4,632,696
da Danish 3,016,838 0 0 21,679,997 13,915,841 14,429,778 53,042,454
de German 27,681,897 3,725,002 2,482,920 21,723,929 13,089,209 8,366,765 77,069,722
el Greek 0 0 0 25,069,611 15,403,662 23,714,597 64,187,870
en English 15,488,167 3,818,127 2,670,157 24,207,801 15,580,109 52,101,283 113,865,644
es Spanish 17,475,748 4,324,428 2,816,401 27,001,343 15,885,394 36,378,715 103,882,029
et Estonian 0 0 0 15,962,544 10,899,550 10,296,031 37,158,125
fi Finnish 3,426,226 0 0 16,455,144 10,175,256 15,097,653 45,154,279
fr French 9,170,042 4,393,051 2,928,227 27,351,591 17,178,444 25,961,848 86,983,203
he Hebrew 0 0 0 0 0 16,221,237 16,221,237
hi Hindi 408,616 0 0 0 0 0 408,616
hr Croatian 15,479,547 0 0 0 0 19,092,559 34,572,106
hu Hungarian 5,387,533 0 0 19,176,514 12,306,692 21,239,634 58,110,373
is Icelandic 0 0 0 0 0 1,584,758 1,584,758
it Italian 7,247,545 651,502 2,707,648 24,849,477 15,489,468 14,653,613 65,599,253
ja Japanese 0 0 0 0 0 113,32 113,32
lt Lithuanian 358,253 0 0 18,392,644 11,212,864 557,961 30,521,722
lv Latvian 1,336,888 0 0 18,744,927 11,688,597 280,117 32,050,529
mk Macedonian 3,741,900 0 0 0 0 1,877,210 5,619,110
ms Malay 0 0 0 0 0 3,520,701 3,520,701
mt Maltese 0 0 0 14,133,133 0 0 14,133,133
nl Dutch 9,961,680 313,998 2,955,637 24,746,144 15,563,231 29,362,826 82,903,516
no Norwegian 4,815,797 0 0 0 0 0 4,815,797
pl Polish 17,516,332 0 2,378,025 20,627,627 12,811,143 26,572,483 79,905,610
pt Portuguese 2,393,287 369,434 2,999,903 28,602,556 16,484,692 43,391,919 94,241,791
ro Romanian 3,432,615 0 2,737,807 8,199,565 9,446,369 34,128,511 57,944,867
ru Russian 3,337,545 3,174,152 0 0 0 6,885,753 13,397,450
sk Slovak 7,401,998 0 0 19,222,784 12,734,444 5,134,150 44,493,376
sl Slovenian 900,221 0 0 19,645,598 12,240,548 17,024,593 49,810,960
sq Albanian 0 0 0 0 0 2,003,579 2,003,579
sr Serbian 8,823,894 0 0 0 0 20,776,850 29,600,744
sv Swedish 8,138,161 0 0 20,585,800 13,840,373 14,693,861 57,258,195
tr Turkish 0 0 0 0 0 21,190,828 21,190,828
uk Ukrainian 5,054,034 0 0 0 0 246,059 5,300,093
vi Vietnamese 0 0 0 0 0 1,473,591 1,473,591
Subtotal 194,055,340 20,769,694 24,676,725 430,195,134 265,029,289 488,372,783 1,423,098,965
cs Czech 84,718,325 3,416,272 2,315,118 20,303,101 12,922,658 50,688,186 174,363,660
TOTAL 278,773,665 24,185,966 26,991,843 450,498,235 277,951,947 539,060,969 1,597,462,625

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language Tags Lemmas Brief description Detailed description Tool
Bulgarian in English TreeTagger
Czech in Czech in English2) in English Morče
Dutch in Dutch TreeTagger
English in English in English + additions TreeTagger
Estonian in Estonian and English TreeTagger
Finnish in English3) OMorFi+HunPOS
French in English TreeTagger
German in English4) in German RFTagger
Hungarian in English HunPos
Icelandic IceStagger
Italian in English TreeTagger
Lithuanian in Czech and English in English Author: Vidas Daudaravičius
Norwegian in English in Norwegian analyzer, tagger
Polish in English in Polish in English Morfeusz, TaKIPI
Portuguese Spanish TreeTagger
Russian in English in English5) TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Slovene English totale
Spanish in English TreeTagger
Swedish Stagger

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. “$” in the English tag “wp$”, must be preceded in queries by a backslash: tag=“wp\$”.

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

  • MorfFlex, Morče and LanGr for Czech
  • TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
  • Morfeusz and TaKIPI for Polish
  • HunPOS for Hungarian and other languages
  • Tagger for Slovak (thanks to Radovan Garabík)
  • Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
  • Tagger for Norwegian (thanks to Pavel Vondřička)
  • totale for Slovene (thanks to Tomaž Erjavec)
  • RFTagger for German
  • OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
  • Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)

Citing InterCorp

Rosen, A. – Vavřín, M.: Korpus InterCorp – English, German6), version 7 from 19 Dec 2014. Ústav Českého národního korpusu FF UK, Praha 2014. Available on-line: http://www.korpus.cz

Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427.

See also

1)
@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411–427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}
2)
There is a helper application to assist you with queries including Czech morphological tags. Click here.
3)
The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].
4)
Within a single tag, semicolon is used instead of comma as a separator of individual morphological categories, e.g. ADJA:Pos:Nom:Sg:Fem.
5)
Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.
6)
Insert actually used languages.