InterCorp Release 9

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	120,443,181	117,981,673	278,445,878	1,556,840,965
Positions	Number of word forms	96,956,714	89,645,545	231,501,606	1,228,896,294
Structural attributes	Number of documents	1430	5	2,934	89
	Number of div	1,430	111,263	2,934	1,849,184
	Number of sentences	8,308,814	13,588,082	17,210,601	143,478,514
Further information	reference	YES
	representative	NO
	publication date	2016
	foreign languages	39
	tagged languages	23
	lemmatized languages	20

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

References

If you publish results based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work using InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M.: Korpus InterCorp – English, German¹⁾, version 7 of 19 Dec 2014. Institute of the Czech National Corpus, Charles University, Prague 2014. Available on-line: http://www.korpus.cz

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

Political commentaries published by Project Syndicate and VoxEurop (formerly PressEurop)
A package of legal texts of the European Union form the Acquis Communautaire corpus
Proceedings of the European Parliament dated 2007–2011 from the Europarl corpus
Film subtitles from the Open Subtitles database

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 9 from July 2016 is 231 mil. words in the aligned foreign language texts in the core part and 1,228 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in thousands of words

Language		Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Total
ar	Arabic	34	0	0	0	0	0	34
be	Belarusian	3,025	0	0	0	0	0	3,025
bg	Bulgarian	6,007	0	0	13,816	9,083	0	28,907
ca	Catalan	4,632	0	0	0	0	0	4,632
da	Danish	3,556	0	0	21,679	13,915	14,429	53,581
de	German	31,168	3,725	2,482	21,723	13,089	8,366	80,556
el	Greek	0	0	0	25,069	15,403	23,714	64,187
en	English	21,208	3,818	2,670	24,207	15,580	52,101	119,586
es	Spanish	19,310	4,324	2,816	27,001	15,885	36,378	105,716
et	Estonian	0	0	0	15,962	10,899	10,296	37,158
fi	Finnish	3,645	0	0	16,455	10,175	15,097	45,373
fr	French	12,406	4,393	2,928	27,351	17,178	25,961	90,219
he	Hebrew	0	0	0	0	0	16,221	16,221
hi	Hindu	408	0	0	0	0	0	408
hr	Croatian	19,980	0	0	0	0	19,042	39 023
hu	Hungarian	5,818	0	0	19,176	12,306	21,239	58,541
is	Icelandic	0	0	0	0	0	1,584	1,584
it	Italian	8,694	651	2,707	24,849	15,489	14,653	67,046
ja	Japanese	0	0	0	0	0	113	113
lt	Lithuanian	358	0	0	18,392	11,212	557	30,521
lv	Latvian	1,666	0	0	24,667	13,895	381	40,609
mk	Macedonian	4,663	0	0	0	0	1,877	6,540
ms	Malay	0	0	0	0	0	3,520	3,520
mt	Maltese	0	0	0	14,133	0	0	14,133
nl	Dutch	11,444	314	2,955	24,746	15,563	29,362	84,386
no	Norwegian	4,965	0	0	0	0	0	4,965
pl	Polish	21,433	0	2,378	20,627	12,	26,572	83,822
pt	Portuguese	2,605	369	2,999	28,602	16,484	43,391	94,454
rn	Romani	5	0	0	0	0	0	5
ro	Romanian	3,432	0	2,737	8,199	9,446	34,128	57,944
ru	Russian	4,788	3,174	0	0	0	6,885	14,848
sk	Slovak	8,066	0	0	19,222	12,734	5,134	45,158
sl	Slovenian	2,057	0	0	19,645	12,240	17,024	50,968
sq	Albanian	0	0	0	0	0	2,003	2,003
sr	Serbian	9,886	0	0	0	0	20,720	30,607
sv	Swedish	8,959	0	0	20,585	13,840	14,693	58,079
tr	Turkish	0	0	0	0	0	21,190	21,190
uk	Ukrainian	7,597	0	0	0	0	246	7,843
vi	Vietnamese	0	0	0	0	0	1,473	1,473
Subtotal		231,501	20,769	24,676	430,160	265,022	488,266	1,460,397
cs	Czech	96,956	3,416	2,315	20,303	12,922	50,688	186,602
TOTAL		328,458	24,186	26,991	450,463	277,945	538,954	1,647,000

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language	Tags	Lemmas	Brief description	Detailed description	Tool
Bulgarian	✔			in English	TreeTagger
Croatian	✔	✔	in English		ReLDI Tagger
Czech	✔	✔	in Czech and in English²⁾	in English	Morče
Dutch	✔		in English	in Dutch	TreeTagger
English	✔	✔	in English	in English + additions	TreeTagger
Estonian	✔	✔	in Estonian and English		TreeTagger
Finnish	✔	✔		English³⁾	OMorFi+HunPOS
French	✔	✔	in English		TreeTagger
German	✔	✔	in English⁴⁾	in German	RFTagger
Hungarian	✔			in English	HunPos
Icelandic	✔	✔	in English		IceStagger
Italian	✔	✔	in English		TreeTagger
Latvian	✔	✔	in Latvian		LVTagger
Lithuanian	✔	✔	in Czech and English	in English	Author: Vidas Daudaravičius
Norwegian	✔	✔	in English and Norwegian		VISL
Polish	✔	✔	in English and Polish	in English	Morfeusz, TaKIPI
Portuguese	✔	✔	in Spanish		TreeTagger
Russian	✔	✔	in English	in English ⁵⁾	TreeTagger
Slovak	✔	✔	in Slovak	in Slovak	Radovan Garabík, Morče
Slovene	✔	✔	in English and Slovene	in English	ToTaLe
Serbian	✔	✔	in English		ReLDI Tagger
Spanish	✔	✔	in English		TreeTagger
Swedish	✔	✔	in Swedish and English		Stagger

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. „$“ in the English tag „wp$“, must be preceded in queries by a backslash: tag=„wp\$“.

Structural attributes

Structure	Attribute	Description	Values
doc	doc.id	unique document identifier	text
	doc.lang	language	ar / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh
	doc.version	version	number
	doc.wordcount	document size in words	number
div	div.id	text identification	author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE
	div.group	division in	Core / Acquis / Europarl / PressEurop / Subtitles / Syndicate
	div.wordcount	number of words	number
	div.author	author	last name, first name
	div.title	full title	text
	div.publisher	publisher	text
	div.pubplace	publication place	text
	div.pubyear	publication year	date
	div.txtype	text type	discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles
	div.original	is the text an original?	Yes / No
	div.srclang	language of the original	ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh
	div.translator	translator	last name, first name
	div.transsex	translator's gender	F / M
	div.authsex	author's gender	F / M
p	p.id	unique paragraph identifier	text
s	s.id	unique sentence identifier	text

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Fiction in many Slavic and some other languages from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
Political commentaries in a number of languages from the site Project Syndicate
Newspaper texts in a number of languages from the Presseurop/VoxEurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Proceedings of the European Parliament from the EuroParl corpus
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober
Film subtitles from the database Open Subtitles

Pre-processing

Parallel text editor InterText by Pavel Vondřička
Aligner Hunalign
Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

MorfFlex, Morče and LanGr for Czech
TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
Morfeusz and TaKIPI for Polish
HunPOS for Hungarian and other languages
Tagger for Slovak (thanks to Radovan Garabík)
Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
Tagger for Norwegian (thanks to Pavel Vondřička)
totale for Slovene (thanks to Tomaž Erjavec)
RFTagger for German
OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)
RelDI tagger for Croatian and Serbian (thanks to Nikola Ljubešić)
LVTagger for Latvian (thanks to Pēteris Paikens and Michal Škrabal)