InterCorp

InterCorp

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. It serves as a source of data for theoretical studies, lexicography, student research, (foreign) language learning, computer applications, translators and also for the general public.

All texts in InterCorp and all features of the search interface are available after free registration and login via KonText or Treq interface. The registration is identical for all public ICNC corpora. No special registration for InterCorp is required if you already have user login and password for the Czech part of InterCorp.

InterCorp is a part of the Czech National Corpus, a project funded by the Ministry of Education of the Czech Republic within the programme Large Research, Development and Innovation Infrastructures (LM2018137; 2020–22). In 2016-2019, 2012-2015 and 2005-2011 the project was supported from the same source (projects no. LM2015044, LM2011023 and 0021620823, respectively). The entire project is academic and non-commercial.

Description

Starting with Release 6, InterCorp can be seen as referential: all its previous releases stay available in their originally published form. The volume of texts, the number of languages and the extent of annotation (lemmatization and tagging) may grow with each new release and the introduction of new tools.

For more details about the individual releases of InterCorp see the overview below:

Release	Publication year	Number of words in millions¹⁾	Number of foreign languages	Tagged / lemmatized	List of changes
InterCorp Release 16ud	2024	4 859,2	61	47 / 47	Release 16ud
InterCorp Release 16	2023	4 893,0	61	27 / 25	Release 16
InterCorp Release 15	2022	1 588.2	41	27 / 25	Release 15
InterCorp Release 14	2022	1 572.0	41	27 / 25	Release 14
InterCorp Release 13ud	2021	1 551.2	40	35 / 35	Release 13ud
InterCorp Release 13	2020	1 551.2	40	27 / 25	Release 13
InterCorp Release 12	2019	1 533.7	40	27 / 25	Release 12
InterCorp Release 11	2018	1 508.4	39	26 / 25	Release 11
InterCorp Release 10	2017	1,483.8	39	23 / 22	Release 10
InterCorp Release 9	2016	1,460.0	39	23 / 20	Release 9
InterCorp Release 8	2015	1,423.0	38	20 / 17	Release 8
InterCorp Release 7	2014	1,390.0	38	20 / 17	Release 7
InterCorp Release 6	2013	867.3	31	17 / 14	Release 6
InterCorp Release 5	2012	542.6	27	17 / 14	Release 5
InterCorp Release 4	2011	92.3	22	13 / 10	Release 4
InterCorp Release 3	2011	72.3	22	13 / 10	Release 3
InterCorp Release 2	2009	49.3	21	10 / 7	Release 2
InterCorp Release 1	2009	34.5	20	10 / 7	Release 1
InterCorp Release 0	2008	25.0	19	0 / 0	Release 0

The corpus consists of two parts: core and collections. The core of InterCorp consists mostly of fiction with manually checked alignments. Collections are texts acquired in multiple languages, processed and aligned automatically: concordances may include more misaligned segments. Moreover, collection do not always include all texts from the original source, such as texts without a Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText (previously also via NoSketch Engine and Park). There is a Czech tutorial on Kontext.

Specifying a parallel query

Result of a query for substrings lieb and lov

Contacts

Project coordination, technical support and web pages administration

Alexandr Rosen
Institute of the Czech National Corpus
email: alexandr.rosen(at mark)ff.cuni.cz

Discussion group

intercorp(at mark)ff.cuni.cz - group address, please use when appropriate

Participants

Coordinators for specific languages

	Arabic Doc. PhDr. Petr Zemánek CSc. Institute of Comparative Linguistics PhDr. Jiří Milička, Ph.D. Institute of the Czech National Corpus
	Belarusian PhDr. Veranika Bialkovich
	Bulgarian Prof. PhDr. Hana Gladkova, CSc. Department of South Slavonic and Balkan Studies Mgr. Natalie Kalajdžievová Ph.D. Department of South Slavonic and Balkan Studies
	Catalan Mgr. Andreu Bauçà i Sastre, Ph.D. Centre Carlemany de Llengua Catalana, Department of Romance Studies, Ensenyament Superior, Recerca i Ajuts a l’Estudi, Govern d'Andorra
	Chinese Mgr. Vlastimil Dobečka Department of Asian Studies, Faculty of Arts, Palacký University, Olomouc
	Croatian Mgr. Karel Jirásek, Ph.D. Department of South Slavonic and Balkan Studies
	Danish Mgr. Jana Pavlisová Mgr. Kateřina Haušildová Department of Germanic Studies
	Dutch Mgr. Eliška Boková PhDr. Zdenka Hrnčířová Department of Germanic Studies
	English Mgr. Denisa Šebestová Department of English Language and ELT Methodology doc. PhDr. Markéta Malá, Ph.D. Department of Linguistics Mgr. Michal Kubánek Department of English and American Studies, Faculty of Arts, Palacký University Olomouc
	Finnish Mgr. Lenka Fárová, Ph.D. Department of Germanic Studies
	French PhDr. Olga Nádvorníková Ph.D. Department of Romance Studies
	German Mgr. Štěpán Zbytovský, Ph.D. Department of Germanic Studies Mgr. Tomáš Káňa, Ph.D. Department of German Language and Literature, Faculty of Education, Masaryk University, Brno PhDr. Hana Peloušková, Ph.D. Department of German Language and Literature, Faculty of Education, Masaryk University, Brno PhDr. Vít Dovalil, Ph.D. Department of Germanic Studies
	Hindi Mgr. Nora Melnikova, Ph.D. Institute of South and Central Asia Bc. Vojtěch Diatka Department of Linguistics
	Hungarian Mgr. Simona Kolmanová, Ph.D. Department of Central European Studies
	Italian doc. Pavel Štichauer, Ph.D. Department of Romance Studies
	Japanese Mgr. Petra Kanasugi, Ph.D. Institute of East Asian Studies
	Latvian Mgr. Michal Škrabal, Ph.D. Institute of the Czech National Corpus Mgr. Marija Lazar
	Lithuanian Mgr. Věra Kociánová RNDr. Hana Skoumalová, Ph.D.
	Macedonian PhDr. Michala Adamová Institute of the Czech National Corpus Mgr. Vojkan Milenković
	Norwegian Mgr. Pavel Vondřička Ph.D. Institute of the Czech National Corpus
	Polish Mgr. Łucja Bańczyk Dr. Renata Dybalska Department of Central European Studies
	Portuguese PhDr. Jaroslava Jindrová Ph.D. Department of Romance Studies
	Romani Ruben Pellar, Master of Arts, Ph.D.
	Romanian Ing. Alexandr Krestovský Univerzita Karlova v Praze CERGE
	Russian PhDr. Natálie Rajnochová, Ph.D. Department of East European Studies Mgr. Naděžda Runštuková
	Serbian PhDr. Ana Adamovičová Institute of Czech Studies
	Slovak doc. PhDr. Mira Nábělková CSc. Department of East European Studies
	Slovenian Mgr. Leoš Soustružník Mgr. David Blažek, Ph.D. Institute of Slavonic Studies, Czech Academy of Sciences
	Spanish Doc. PhDr. Petr Čermák, Ph.D. Department of Romance Studies
	Swedish Lenka John Embassy of Sweden Mgr. Silvie Cinková, Ph.D.
	Ukrainian Dr. Natalia Kotsyba

Citing InterCorp

Specific language combination: Author 1, Author 2 & Author 3²⁾ (2022): InterCorp – English, German ³⁾, Release 15 of 11 November 2022. Institute of the Czech National Corpus, Charles University, Prague. Available from: http://www.korpus.cz

Whole corpus: Rosen, A., Vavřín, M. & Zasina, A. J. (2022): InterCorp, Release 15 of 11 November 2022. Institute of the Czech National Corpus, Charles University. Available from: http://www.korpus.cz

Čermák, F. & Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427. electronic version at IngentaConnect, preprint version