This is an old revision of the document!

InterCorp

InterCorp

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. It serves as a source of data for theoretical studies, lexicography, student research, (foreign) language learning, computer applications, translators and also for the general public.

All texts in InterCorp and all features of the search interface are available after free registration and login. The registration is identical for all public ICNC corpora. No special registration for InterCorp is required if you already have user login and password for the Czech part of InterCorp.

InterCorp is a part of the Czech National Corpus, a project funded by the Ministry of Education of the Czech Republic within the programme Large Research, Development and Innovation Infrastructures (LM2015044; 2016-2019). In 2012-2015 and 2005-2011 the project was supported from the same source (projects no. LM2011023 and 0021620823, respectively). The entire project is academic and non-commercial.

Description

Starting with Release 6, InterCorp can be seen as referential: all its previous releases stay available in their originally published form. The volume of texts, the number of languages and the extent of annotation (lemmatization and tagging) may grow with each new release and the introduction of new tools.

For more details about the individual releases of InterCorp see the overview below:

Release	Publication year	Number of words in millions¹⁾	Number of foreign languages	Tagged / lemmatized	List of changes
Intercorp Release 10	2017	1,483.8	39	23 / 22	Release 10
Intercorp Release 9	2016	1,460.0	39	23 / 20	Release 9
Intercorp Release 8	2015	1,423.0	38	20 / 17	Release 8
Intercorp Release 7	2014	1,390.0	38	20 / 17	Release 7
Intercorp Release 6	2013	867.3	31	17 / 14	Release 6
Intercorp Release 5	2012	542.6	27	17 / 14	Release 5
Intercorp Release 4	2011	92.3	22	13 / 10	Release 4
Intercorp Release 3	2011	72.3	22	13 / 10	Release 3
Intercorp Release 2	2009	49.3	21	10 / 7	Release 2
Intercorp Release 1	2009	34.5	20	10 / 7	Release 1
Intercorp Release 0	2008	25.0	19	0 / 0	Release 0

The corpus consists of two parts: core and collections. The core of InterCorp consists mostly of fiction with manually checked alignments. Collections are texts acquired in multiple languages, processed and aligned automatically: concordances may include more misaligned segments. Moreover, collection do not always include all texts from the original source, such as texts without a Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText (previously also via NoSketch Engine and Park). There is a Czech tutorial on Kontext.

Specifying a parallel query

Result of a query for substrings lieb and lov

Contacts

Project coordination, technical support and web pages administration: martin.vavrin(at mark)ff.cuni.cz

Project administration: alexandr.rosen(at mark)ff.cuni.cz, lucie.novakova(at mark)ff.cuni.cz

Discussion group: intercorp(at mark)ff.cuni.cz - group address, please use only in justified cases

Participants

Project administration

Ing. Alexandr Rosen, Ph.D.
Institute of Theoretical and Computational Linguistics

Ing. Lucie Nováková
Institute of Czech National Corpus

Software and technical support

Bc. Martin Vavřín
Institute of the Czech National Corpus

Mgr. Bc. Adrian Zasina
Institute of the Czech National Corpus

Coordinators for specific languages

	Arabic Doc. PhDr. Petr Zemánek CSc. Institute of Comparative Linguistics Mgr. Jiří Milička Institute of Comparative Linguistics
	Belarusian PhDr. Veranika Bialkovich
	Bulgarian Prof. PhDr. Hana Gladkova, CSc. Department of South Slavonic and Balkan Studies Mgr. Natalie Kalajdžievová Ph.D. Department of South Slavonic and Balkan Studies
	Catalan Mgr. Andreu Bauçà i Sastre, PhD. Centre Carlemany de Llengua Catalana, Department of Romance Studies Mgr. Joan Ramon Marina Amat Ensenyament Superior, Recerca i Ajuts a l’Estudi, Govern d'Andorra
	Chinese Mgr. Vlastimil Dobečka Department of Asian Studies, Faculty of Arts, Palacký University, Olomouc
	Croatian Mgr. Karel Jirásek, Ph.D. Department of South Slavonic and Balkan Studies
	Danish Mgr. Jana Pavlisová Mgr. Kateřina Haušildová Department of Germanic Studies
	Dutch Mgr. Eliška Boková PhDr. Zdenka Hrnčířová Department of Germanic Studies
	English Prof. PhDr. Aleš Klégr Department of English Language and ELT Methodology PhDr. Markéta Malá, Ph.D. Department of English Language and ELT Methodology PhDr. Pavlína Šaldová, Ph.D. Department of English Language and ELT Methodology Mgr. Leona Rohrauer Department of English Language and ELT Methodology Mgr. Michal Kubánek Department of English and American Studies, Faculty of Arts, Palacký University Olomouc
	Finnish Mgr. Lenka Fárová, Ph.D. Department of Germanic Studies
	French PhDr. Olga Nádvorníková Ph.D. Department of Romance Studies
	German PhDr. Vít Dovalil, Ph.D. Department of Germanic Studies Mgr. Štěpán Zbytovský, Ph.D. Department of Germanic Studies Mgr. Tomáš Káňa, Ph.D. Department of German Language and Literature, Faculty of Education, Masaryk University, Brno PhDr. Hana Peloušková, Ph.D. Department of German Language and Literature, Faculty of Education, Masaryk University, Brno
	Hindi Mgr. Nora Melnikova, Ph.D. Institute of South and Central Asia
	Hungarian Mgr. Simona Kolmanová, Ph.D. Department of Central European Studies
	Italian doc. Pavel Štichauer, Ph.D. Department of Romance Studies
	Japanese Mgr. Petra Kanasugi, Ph.D. Institute of East Asian Studies
	Latvian Mgr. Michal Škrabal, Ph.D. Institute of Czech National Corpus
	Lithuanian RNDr. Hana Skoumalová, Ph.D. Institute of Theoretical and Computational Linguistics
	Macedonian PhDr. Michala Adamová Institute of Czech National Corpus Mgr. Vojkan Milenković Department of East European Studies
	Norwegian Mgr. Pavel Vondřička Ph.D. Institute of Czech National Corpus
	Polish Mgr. Łucja Bańczyk Dr. Renata Dybalska Department of Central European Studies
	Portuguese PhDr. Jaroslava Jindrová Ph.D. Department of Romance Studies
	Romani Ruben Pellar, Master of Arts, Ph.D.
	Romanian Ing. Alexandr Krestovský Univerzita Karlova v Praze CERGE
	Russian PhDr. Natálie Rajnochová, Ph.D. Department of East European Studies Mgr. Naděžda Runštuková
	Serbian PhDr. Ana Adamovičová Institute of Czech Studies
	Slovak doc. PhDr. Mira Nábělková CSc. Department of East European Studies
	Slovenian Mgr. Leoš Soustružník Mgr. David Blažek, Ph.D. Institute of Slavonic Studies, Czech Academy of Sciences
	Spanish Doc. PhDr. Petr Čermák, Ph.D. Department of Romance Studies
	Swedish Mgr. Silvie Cinková, Ph.D. Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics
	Ukrainian Dr. Natalia Kotsyba

Citing InterCorp

Specific language combination: Author, 1., Author, 2., Author, 3.²⁾: InterCorp – English, German ³⁾, Release 8 of 4 June 2015. Institute of the Czech National Corpus, Charles University, Prague 2015. Available from: http://www.korpus.cz

Whole corpus: Rosen, A., Vavřín, M.: InterCorp, Release 8 of 4 June 2015. Institute of the Czech National Corpus, Charles University, Prague 2015. Available from: http://www.korpus.cz

Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427. bibtex, electronic version at IngentaConnect, preprint version

Table of Contents

InterCorp

Description

Contacts

Participants

Project administration

Software and technical support

Coordinators for specific languages

Citing InterCorp

See also

Search

Navigation

Print/export

Tools

Languages

Licence