This is an old revision of the document!

InterCorp

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	105 239 198	117 981 673	233 509 950	1 560 655 498
Positions	Number of word forms	84 718 325	89 645 545	194 055 340	1 229 043 791
Structural attributes	Number of documents	1 279	5	2 513	89
	Number of div	1 279	111 263	2 513	1 849 184
	Number of sentences	7 250 794	13 588 082	14 377 637	143 478 514
Further information	reference	YES
	representative	NO
	publication date	2015
	foreign languages	38
	tagged languages	20
	lemmatized languages	17

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial in Czech is available here.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

References

We would appreciate a link to the project site www.korpus.cz/intercorp in results of your work based on InterCorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex¹⁾, electronic edition at ing entaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work using InterCorp is welcome. See here for details.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

Political commentaries published by Project Syndicate and Presseurop
A package of legal texts of the European Union form the Acquis Communautaire corpus
Proceedings of the European Parliament dated 2007–2011 from the Europarl corpus
Film subtitles from the Open Subtitles database

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 8 from May 2015 is 195 mil. words in the aligned foreign language texts in the core part and 1,229 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the sizes in millions of words.

Setup of the parallel corpus – the core and collections

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in the number of words

Language		Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Total
ar	Arabic	34,325	0	0	0	0	0	34,325
be	Belarusian	2,152,724	0	0	0	0	0	2,152,724
bg	Bulgarian	5,240,831	0	0	13,816,405	9,083,403	0	28,140,639
ca	Catalan	4,632,696	0	0	0	0	0	4,632,696
da	Danish	3,016,838	0	0	21,679,997	13,915,841	14,429,778	53,042,454
de	German	27,681,897	3,725,002	2,482,920	21,723,929	13,089,209	8,366,765	77,069,722
el	Greek	0	0	0	25,069,611	15,403,662	23,714,597	64,187,870
en	English	15,488,167	3,818,127	2,670,157	24,207,801	15,580,109	52,101,283	113,865,644
es	Spanish	17,475,748	4,324,428	2,816,401	27,001,343	15,885,394	36,378,715	103,882,029
et	Estonian	0	0	0	15,962,544	10,899,550	10,296,031	37,158,125
fi	Finnish	3,426,226	0	0	16,455,144	10,175,256	15,097,653	45,154,279
fr	French	9,170,042	4,393,051	2,928,227	27,351,591	17,178,444	25,961,848	86,983,203
he	Hebrew	0	0	0	0	0	16,221,237	16,221,237
hi	Hindi	408,616	0	0	0	0	0	408,616
hr	Croatian	15,479,547	0	0	0	0	19,092,559	34,572,106
hu	Hungarian	5,387,533	0	0	19,176,514	12,306,692	21,239,634	58,110,373
is	Icelandic	0	0	0	0	0	1,584,758	1,584,758
it	Italian	7,247,545	651,502	2,707,648	24,849,477	15,489,468	14,653,613	65,599,253
ja	Japanese	0	0	0	0	0	113,32	113,32
lt	Lithuanian	358,253	0	0	18,392,644	11,212,864	557,961	30,521,722
lv	Latvian	1,336,888	0	0	18,744,927	11,688,597	280,117	32,050,529
mk	Macedonian	3,741,900	0	0	0	0	1,877,210	5,619,110
ms	Malay	0	0	0	0	0	3,520,701	3,520,701
mt	Maltese	0	0	0	14,133,133	0	0	14,133,133
nl	Dutch	9,961,680	313,998	2,955,637	24,746,144	15,563,231	29,362,826	82,903,516
no	Norwegian	4,815,797	0	0	0	0	0	4,815,797
pl	Polish	17,516,332	0	2,378,025	20,627,627	12,811,143	26,572,483	79,905,610
pt	Portuguese	2,393,287	369,434	2,999,903	28,602,556	16,484,692	43,391,919	94,241,791
ro	Romanian	3,432,615	0	2,737,807	8,199,565	9,446,369	34,128,511	57,944,867
ru	Russian	3,337,545	3,174,152	0	0	0	6,885,753	13,397,450
sk	Slovak	7,401,998	0	0	19,222,784	12,734,444	5,134,150	44,493,376
sl	Slovenian	900,221	0	0	19,645,598	12,240,548	17,024,593	49,810,960
sq	Albanian	0	0	0	0	0	2,003,579	2,003,579
sr	Serbian	8,823,894	0	0	0	0	20,776,850	29,600,744
sv	Swedish	8,138,161	0	0	20,585,800	13,840,373	14,693,861	57,258,195
tr	Turkish	0	0	0	0	0	21,190,828	21,190,828
uk	Ukrainian	5,054,034	0	0	0	0	246,059	5,300,093
vi	Vietnamese	0	0	0	0	0	1,473,591	1,473,591
Subtotal		194,055,340	20,769,694	24,676,725	430,195,134	265,029,289	488,372,783	1,423,098,965
cs	Czech	84,718,325	3,416,272	2,315,118	20,303,101	12,922,658	50,688,186	174,363,660
TOTAL		278,773,665	24,185,966	26,991,843	450,498,235	277,951,947	539,060,969	1,597,462,625

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language	Tags	Lemmas	Brief description	Detailed description	Tool
Bulgarian	✔	in English	TreeTagger
Czech	✔	✔	in Czech in English²⁾	in English	Morče
Dutch	✔	in Dutch	TreeTagger
English	✔	✔	in English	in English + additions	TreeTagger
Estonian	✔	✔	in Estonian and English	TreeTagger
Finnish	✔	✔	in English³⁾	OMorFi+HunPOS
French	✔	✔	in English	TreeTagger
German	✔	✔	in English⁴⁾	in German	RFTagger
Hungarian	✔	in English	HunPos
Icelandic	✔	✔	IceStagger
Italian	✔	✔	in English	TreeTagger
Lithuanian	✔	✔	in Czech and English	in English	Author: Vidas Daudaravičius
Norwegian	✔	✔	in English in Norwegian	analyzer, tagger
Polish	✔	✔	in English in Polish	in English	Morfeusz, TaKIPI
Portuguese	✔	✔	Spanish	TreeTagger
Russian	✔	✔	in English	in English⁵⁾	TreeTagger
Slovak	✔	✔	in Slovak	in Slovak	Radovan Garabík, Morče
Slovene	✔	✔	English	totale
Spanish	✔	✔	in English	TreeTagger
Swedish	✔	✔	Stagger

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. “$” in the English tag “wp$”, must be preceded in queries by a backslash: tag=“wp\$”.

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Fiction in many Slavic and some other languages from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
Political commentaries in a number of languages from the site Project Syndicate
Newspaper texts in a number of languages from the Presseurop/VoxEurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Proceedings of the European Parliament from the EuroParl corpus
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober
Film subtitles from the database Open Subtitles

Pre-processing

parallel text editor InterText by Pavel Vondřička
Aligner Hunalign
Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

MorfFlex, Morče and LanGr for Czech
TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
Morfeusz and TaKIPI for Polish
HunPOS for Hungarian and other languages
Tagger for Slovak (thanks to Radovan Garabík)
Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
Tagger for Norwegian (thanks to Pavel Vondřička)
totale for Slovene (thanks to Tomaž Erjavec)
RFTagger for German
OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)

Citing InterCorp

Rosen, A. – Vavřín, M.: Korpus InterCorp – English, German⁶⁾, version 7 from 19 Dec 2014. Ústav Českého národního korpusu FF UK, Praha 2014. Available on-line: http://www.korpus.cz

Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427.