InterCorp: Release 6

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	76 861 107	46 880 365	167 141 155	890 129 077
Positions	Number of word forms	61 962 499	37 584 764	138 762 949	728 507 959
Structural attributes	Number of documents	996	4	1 939	56
	Number of div	996	96 988	1 939	1 728 492
	Number of sentences	5 254 361	2 392 808	10 283 732	44 113 753
Further information	reference	YES
	representative	NO
	publication date	2013
	foreign languages	31
	tagged languages	17
	lemmatized languages	14

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser in two ways:

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
From Park, a purpose-built interface. A brief user manual is available here.

Both search interfaces are based on the Manatee corpus engine and access identical texts. Park can also be used to search the previous version of the corpus.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. If you are interested, please contact us at the address below.

Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental. With each new release, its size, or even the number of languages and the extent and quality of annotation may grow.

References

In results of your work based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex¹⁾, electronic edition at //ing entaConnect//, preprint version).

For more references see here. Additional references to work using InterCorp are welcome. Please let us know at the e-mail address below.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The current choice includes political commentaries published by Project Syndicate and Presseurop, a package of legal texts of the European Union form the Acquis Communautaire corpus, and proceedings of the European Parliament dated 2007–2011 from the Europarl corpus. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Some texts from the Acquis Communautaire a Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. Moreover, even some core texts in the current release no. 6 are temporarily aligned only automatically without manual checking. This concerns a part of texts acquired from ASPAC – Amsterdam Slavic Parallel Aligned Corpus. Alignment of these texts will be checked and corrected before the next release.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Code	Language	Core	Syndicate	Presseurop	Acquis	Europarl	Total
ar	Arabic	29	0	0	0	0	29
be	Belarusian	1 308	0	0	0	0	1 308
bg	Bulgarian	3 979	0	0	13 816	9 083	26 879
ca	Catalan	1 758	0	0	0	0	1 758
da	Danish	190	0	0	21 680	13 916	35 785
de	German	17 256	3 050	1 715	21 724	13 089	56 835
el	Greek	210	0	0	25 070	15 404	40 683
en	English	10 019	3 083	1 863	24 208	15 580	54 753
es	Spanish	14 552	3 479	1 948	27 001	15 885	62 865
et	Estonian	0	0	0	15 963	10 900	26 862
fi	Finnish	2 131	0	0	16 667	10 241	29 040
fr	French	3 816	3 535	2 054	27 352	17 178	53 936
hi	Hindi	155	0	0	0	0	155
hr	Croatian	12 625	0	0	0	0	12 625
hu	Hungarian	2 511	0	0	19 168	12 307	33 985
it	Italian	4 081	247	1 893	24 850	15 489	46 560
lt	Lithuanian	358	0	0	18 433	11 020	29 811
lv	Latvian	1 337	0	0	18 745	11 689	31 770
mk	Macedonian	2 664	0	0	0	0	2 664
mt	Maltese	0	0	0	14 133	0	14 133
nl	Dutch	9 426	0	2 082	24 746	15 563	51 817
no	Norwegian	2 301	0	0	0	0	2 301
pl	Polish	12 710	0	1 660	20 464	12 805	47 640
pt	Portuguese	2 318	0	2 103	28 599	16 481	49 502
ro	Romanian	2 433	0	1 917	8 200	9 446	21 995
ru	Russian	4 937	2 651	0	0	0	7 588
sk	slovenština	8 152	0	0	19 222	12 734	40 108
sl	Slovene	1 855	0	0	19 646	12 241	33 741
sr	Serbian	6 972	0	0	0	0	6 972
sv	Swedish	7 205	0	0	20 615	13 874	41 694
uk	Ukrainian	1 493	0	0	0	0	1 493
Total		138 779	16 044	17 237	430 300	264 926	867 287
cs	Czech	61 962	2 741	1 639	20 285	12 920	99 547

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language	Tags	Lemmas	Brief description	Detailed description	Tool
Bulgarian	✔	in English	TreeTagger
Czech	✔	✔	in Czech in English²⁾	in English	Morče
Dutch	✔	in Dutch	TreeTagger
English	✔	✔	in English	in English + additions	TreeTagger
Estonian	✔	✔	Estonian and English	TreeTagger
French	✔	✔	in English	TreeTagger
German	✔	✔	in German	TreeTagger
Hungarian	✔	in English	HunPos
Italian	✔	✔	in English	TreeTagger
Lithuanian	✔	✔	in Czech and English	in English	Author: Vidas Daudaravičius
Norwegian	✔	✔	in English in Norwegian	analyzer, tagger
Polish	✔	✔	in English in Polish	in English	Morfeusz, TaKIPI
Portuguese	✔	✔	Spanish	TreeTagger
Russian	✔	✔	in English	in English³⁾	TreeTagger
Slovak	✔	✔	in Slovak	in Slovak	Radovan Garabík, Morče
Slovene	✔	✔	English	totale
Spanish	✔	✔	in English	TreeTagger

See Park Manual for advice on the use of tags in queries.

Problems, comments, suggestions

… on the content of the corpus and on the search interfaces are welcome at

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Fiction in many Slavic and some other languages fromASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
Political commentaries in a number of languages from the site Project Syndicate <br>

<a id=“logo” href=“http://www.project-syndicate.org/|<img class=“nodeco” alt=“The highest quality commentaries and analysis from distinguished voices across the world.” title=“The highest quality commentaries and analysis from distinguished voices across the world.” src=“img/ProjectSyndycateLogo.jpg|]]

Newspaper texts in a number of languages from the Presseurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Proceedings of the European Parliament from the EuroParl corpus
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus (in prep.)
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober

Pre-processing

parallel text editor InterText by Pavel Vondřička
Aligner Hunalign
Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

Morče for Czech
TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, German, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
Morfeusz and TaKIPI for Polish
HunPOS for Hungarian
Tagger for Slovak (thanks to Radovan Garabík)
Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
Tagger for Norwegian (thanks to Pavel Vondřička)
totale for Slovene (thanks to Tomaž Erjavec)

Corpus Query Engine:

Last update: 2 February 2014