InterCorp: Release 7

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	95 814 527	116 374 744	208 845 922	1 546 493 833
Positions	Number of word forms	77 121 760	88 303 155	173 224 560	1 216 880 655
Structural attributes	Number of documents	1 184	5	2 294	87
	Number of div	1 184	107 388	2 294	1 817 043
	Number of sentences	6 595 174	13 497 188	12 796 035	142 788 867
Further information	reference	YES
	representative	NO
	publication date	2014
	foreign languages	38
	tagged languages	20
	lemmatized languages	17

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is published mostly each year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (in Park starting from release 5, in the other interfaces from release 6).

References

In results of your work based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex¹⁾, electronic edition at //ing entaConnect//, preprint version).

For more references see here. or in the repository of bibliographical items based on the CNC. All references to work using InterCorp is welcome. See here for details.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release 7 includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources, have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Code	Language	Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Total
ar	Arabic	34	0	0	0	0	0	34
be	Belarusian	1,751	0	0	0	0	0	1,751
bg	Bulgarian	4,923	0	0	13,816	9,083	0	27,823
ca	Catalan	4,498	0	0	0	0	0	4,498
da	Danish	1,311	0	0	21,680	13,916	14,430	51,336
de	German	26,315	3,050	1,715	21,724	13,089	8,367	74,260
el	Greek	0	0	0	25,070	15,404	23,715	64,188
en	English	12,641	3,083	1,863	24,208	15,580	52,101	109,476
es	Spanish	16,907	3,479	1,948	27,001	15,885	36,379	101,599
et	Estonian	0	0	0	15,963	10,900	10,296	37,158
fi	Finnish	3,054	0	0	16,455	10,175	15,098	44,782
fr	French	6,976	3,535	2,054	27,352	17,178	25,962	83,057
he	Hebrew	0	0	0	0	0	16,221	16,221
hi	Hindi	206	0	0	0	0	0	206
hr	Croatian	14,210	0	0	0	0	19,093	33,303
hu	Hungarian	4,014	0	0	19,177	12,307	21,240	56,737
is	Icelandic	0	0	0	0	0	1,585	1,585
it	Italian	6,313	247	1,893	24,849	15,489	14,654	63,446
ja	Japanese	0	0	0	0	0	113	113
lt	Lithuanian	358	0	0	18,393	11,213	558	30,522
lv	Latvian	1,337	0	0	18,745	11,689	280	32,051
mk	Macedonian	3,221	0	0	0	0	1,877	5,098
ms	Malay	0	0	0	0	0	3,521	3,521
mt	Maltese	0	0	0	14,133	0	0	14,133
nl	Dutch	9,370	0	2,082	24,746	15,563	29,363	81,125
no	Norwegian	4,103	0	0	0	0	0	4,103
pl	Polish	16,009	0	1,662	20,628	12,811	26,572	77,683
pt	Portuguese	2,393	0	2,103	28,603	16,485	43,392	92,976
ro	Romanian	3,156	0	1,917	8,200	9,446	34,129	56,847
ru	Russian	3,308	2,651	0	0	0	6,886	12,844
sk	Slovak	7,402	0	0	19,223	12,734	5,134	44,493
sl	Slovene	900	0	0	19,646	12,241	17,025	49,811
sq	Albanian	0	0	0	0	0	2,004	2,004
sr	Serbian	8,413	0	0	0	0	20,777	29,189
sv	Swedish	7,789	0	0	20,586	13,840	14,694	56,909
tr	Turkish	0	0	0	0	0	21,191	21,191
uk	Ukrainian	2,310	0	0	0	0	246	2,556
vi	Vietnamese	0	0	0	0	0	1,474	1,474
Subtotal	173,225	16,044	17,239	430,195	265,029	488,373	1,390,105
cs	Czech	77,122	2,749	1,640	20,303	12,923	50,688	165,425
Total	250,346	18,793	18,880	450,498	277,952	539,061	1,555,530

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language	Tags	Lemmas	Brief description	Detailed description	Tool
Bulgarian	✔	in English	TreeTagger
Czech	✔	✔	in Czech in English²⁾	in English	Morče
Dutch	✔	in Dutch	TreeTagger
English	✔	✔	in English	in English + additions	TreeTagger
Estonian	✔	✔	in Estonian and English	TreeTagger
Finnish	✔	✔	in English³⁾	OMorFi+HunPOS
French	✔	✔	in English	TreeTagger
German	✔	✔	in English⁴⁾	in German	RFTagger
Hungarian	✔	in English	HunPos
Icelandic	✔	✔	IceStagger
Italian	✔	✔	in English	TreeTagger
Lithuanian	✔	✔	in Czech and English	in English	Author: Vidas Daudaravičius
Norwegian	✔	✔	in English in Norwegian	analyzer, tagger
Polish	✔	✔	in English in Polish	in English	Morfeusz, TaKIPI
Portuguese	✔	✔	Spanish	TreeTagger
Russian	✔	✔	in English	in English⁵⁾	TreeTagger
Slovak	✔	✔	in Slovak	in Slovak	Radovan Garabík, Morče
Slovene	✔	✔	English	totale
Spanish	✔	✔	in English	TreeTagger
Swedish	✔	✔	Stagger

See Park Manual for advice on the use of tags in queries.

Problems, comments, suggestions

… on the content of the corpus and on the search interfaces are welcome at

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Fiction in many Slavic and some other languages from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
Political commentaries in a number of languages from the site Project Syndicate
Newspaper texts in a number of languages from the Presseurop/VoxEurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Proceedings of the European Parliament from the EuroParl corpus
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober
Film subtitles from the database Open Subtitles

Pre-processing

parallel text editor InterText by Pavel Vondřička
Aligner Hunalign
Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

MorfFlex, Morče and LanGr for Czech
TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
Morfeusz and TaKIPI for Polish
HunPOS for Hungarian and other languages
Tagger for Slovak (thanks to Radovan Garabík)
Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
Tagger for Norwegian (thanks to Pavel Vondřička)
totale for Slovene (thanks to Tomaž Erjavec)
RFTagger for German
OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)

Corpus Query Engine:

Last update: 19 December 2014

InterCorp: Release 7

Access to the texts

References

Texts in the corpus

Corpus size in thousands of words

Morphosyntactic annotation

Problems, comments, suggestions

Acknowledgements

Texts:

Pre-processing

Taggers/lemmatizers:

Corpus Query Engine:

See also

Search

Navigation

Print/export

Tools

Languages

Licence