InterCorp: Release 3

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

There are several aspects that make InterCorp special among the corpora published by ICNC:

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing.

Texts in the corpus

The bulk of InterCorp consists of fiction in Czech and other languages, semi-automatically aligned, and a selection of political commentaries published by Project Syndicate. The currently available Czech, English, French, German, Russian and Spanish issues, dated 2000-2008, will be followed by more recent texts in future releases of the corpus. These texts have been aligned automatically: search results may include more misaligned segments.

Each texts has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The size of InterCorp for all languages is given in the following table, the Project Syndicate data being included in the counts with their size of approximately 1,5 - 2 million words for a given language (as of February 2011, in InterCorp v. 3, see Version history).

language	number of words (in thousands)	number of texts
Bulgarian	1,135	15
Croatian	6,735	96
Danish	190	5
Dutch	3,914	58
English	5,695	Syndicate + 49
Finnish	1,247	19
French	3,141	Syndicate + 21
German	8,846	Syndicate + 100
Hungarian	1,123	17
Italian	2,817	28
Latvian	1,085	33
Lithuanian	353	17
Norwegian	2,158	21
Polish	4,716	80
Portuguese	1,312	18
Romanian	671	5
Russian	2,951	Syndicate + 25
Serbian	1,724	27
Slovak	6,899	138
Slovene	992	16
Spanish	10,905	Syndicate + 108
Swedish	3,673	47
TOTAL	72,280	943

Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Language	number of words (in thousands)	number of texts
Czech	41,340	Syndicate + 652

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

language	tags	lemmas	brief description	detailed description	tool
English	✔	✔	in English	in English + additions	TreeTagger
Bulgarian	✔	in English	TreeTagger
Czech	✔	✔	in Czech in English¹⁾	in English	Morče
French	✔	✔	in English	TreeTagger
Italian	✔	✔	in English	TreeTagger
Lithuanian	✔	✔	in Czech and English	Vidas Daudaravičius
Hungarian	✔	in English	HunPos
German	✔	✔	in German	in German	TreeTagger
Dutch	✔	TreeTagger
Norwegian	✔	✔	analyzer, tagger
Polish	✔	✔	in English in Polish	in English	Morfeusz, TaKIPI
Russian	✔	✔	in English	in English²⁾	TreeTagger
Slovak	✔	✔	in Slovak	in Slovak	Radovan Garabík, Morče
Spanish	✔	✔	in English	TreeTagger

See Park Manual for advice on the use of tags in queries.

Please note: work in progress

The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following software and data:

Pre-processing

Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit
Aligner Hunalign

Taggers/lemmatizers:

Morče for Czech
TreeTagger for English, German, French, Italian, Dutch, Spanish, Bulgarian and Russian
Morfeusz and TaKIPI for Polish
HunPOS for Hungarian
Tagger for Slovak
Tagger for Lithuanian
Analyzer and tagger for Norwegian

Corpus Query Engine:

Manatee

Data:

Newspaper articles in a number of languages from the site Project Syndicate
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus from Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus (in prep.)
Texts in a number of languages from the ParaSol corpus (in prep.)
Newspaper texts from the PressEurope server (in prep.)
Legal texts in EU languages from the JRC-ACQUIS corpus (in prep.)
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober