InterCorp: Release 5

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

There are several aspects that make InterCorp special among the corpora published by ICNC:

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The current choice includes political commentaries published by Project Syndicate and Presseurop, and a package of legal texts Acquis Communautaire. These texts have been aligned automatically: search results may include a higher number of misaligned segments.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 5 (see Version history) is 91,529,000 words in the aligned foreign language texts in the core part and 451,112,000 in the collections. The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in thousands of words

Language	Core	Syndicate	Presseurop	Acquis	Total
be	67	0	0	0	67
bg	1 415	0	0	13 816	15 231
da	189	0	0	21 680	21 869
de	12 004	2 567	752	21 723	37 048
el	0	0	0	25 069	25 069
en	5 914	2 567	799	24 207	33 489
es	11 811	2 896	860	27 001	42 570
et	0	0	0	15 962	15 962
fi	2 081	0	0	16 667	18 748
fr	3 217	2 969	874	27 351	34 414
hr	8 103	0	0	0	8 103
hu	1 122	0	0	19 167	20 290
it	3 484	80	793	24 849	29 207
lt	352	0	0	18 432	18 785
lv	1 085	0	0	18 744	19 830
mk	32	0	0	0	32
mt	0	0	0	14 133	14 133
nl	6 486	0	899	24 746	32 132
no	2 301	0	0	0	2 301
pl	8 396	0	710	20 464	29 571
pt	2 127	0	913	28 599	31 640
ro	1 370	0	819	8 199	10 389
ru	1 664	2 304	0	0	3 969
sk	7 257	0	0	19 221	26 479
sl	991	0	0	19 645	20 636
sr	4 295	0	0	0	4 295
sv	5 754	0	0	20 615	26 369
Total	91 528	13 385	7 425	430 300	542 640
cs	52 651	2 285	704	20 285	75 926

*) Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

language	tags	lemmas	brief description	detailed description	tool
Bulgarian	✔	in English	TreeTagger
Czech	✔	✔	in Czech in English¹⁾	in English	Morče
Dutch	✔	TreeTagger
English	✔	✔	in English	in English + additions	TreeTagger
Estonian	✔	✔	Estonian and English	TreeTagger
French	✔	✔	in English	TreeTagger
German	✔	✔	in German	in German	TreeTagger
Hungarian	✔	in English	HunPos
Italian	✔	✔	in English	TreeTagger
Lithuanian	✔	✔	in Czech and English	Vidas Daudaravičius
Norwegian	✔	✔	in English in Norwegian	analyzer, tagger
Polish	✔	✔	in English in Polish	in English	Morfeusz, TaKIPI
Portuguese	✔	✔	Spanish	TreeTagger
Russian	✔	✔	in English	in English²⁾	TreeTagger
Slovak	✔	✔	in Slovak	in Slovak	Radovan Garabík, Morče
Slovene	✔	✔	English	totale
Spanish	✔	✔	in English	TreeTagger

See Park Manual for advice on the use of tags in queries.

Please note: work in progress

The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following software and data:

Pre-processing

Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit
Aligner Hunalign

Taggers/lemmatizers:

Morče for Czech
TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, German, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
Morfeusz and TaKIPI for Polish
HunPOS for Hungarian
Tagger for Slovak (thanks to Radovan Garabík)
Tagger for Lithuanian (thanks to Hanka Skoumalová)
Analyzer and tagger for Norwegian (thanks to Pavel Vondřička)
totale for Slovene (thanks to Tomaž Erjavec)

Corpus Query Engine:

Manatee

Data:

Newspaper articles in a number of languages from the site Project Syndicate
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus from Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus (in prep.)
Newspaper texts from the Presseurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober

Last update: 20 July 2012