AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. It serves as a source of data for theoretical studies, lexicography, student research, (foreign) language learning, computer applications, translators and also for the general public.

All texts in InterCorp and all features of the search interface are available after free registration and login via KonText or Treq interface. The registration is identical for all public ICNC corpora. No special registration for InterCorp is required if you already have user login and password for the Czech part of InterCorp.

InterCorp is a part of the Czech National Corpus, a project funded by the Ministry of Education of the Czech Republic within the programme Large Research, Development and Innovation Infrastructures (LM2015044; 2016-2019). In 2012-2015 and 2005-2011 the project was supported from the same source (projects no. LM2011023 and 0021620823, respectively). The entire project is academic and non-commercial.

Description

Starting with Release 6, InterCorp can be seen as referential: all its previous releases stay available in their originally published form. The volume of texts, the number of languages and the extent of annotation (lemmatization and tagging) may grow with each new release and the introduction of new tools.

For more details about the individual releases of InterCorp see the overview below:

Release Publication year Number of words in millions1) Number of foreign languages Tagged / lemmatized List of changes
Intercorp Release 12 2019 1 533.7 40 26 / 25 Release 12
Intercorp Release 11 2018 1 508.4 39 26 / 25 Release 11
Intercorp Release 10 2017 1,483.8 39 23 / 22 Release 10
Intercorp Release 9 2016 1,460.0 39 23 / 20 Release 9
Intercorp Release 8 2015 1,423.0 38 20 / 17 Release 8
Intercorp Release 7 2014 1,390.0 38 20 / 17 Release 7
Intercorp Release 6 2013 867.3 31 17 / 14 Release 6
Intercorp Release 5 2012 542.6 27 17 / 14 Release 5
Intercorp Release 4 2011 92.3 22 13 / 10 Release 4
Intercorp Release 3 2011 72.3 22 13 / 10 Release 3
Intercorp Release 2 2009 49.3 21 10 / 7 Release 2
Intercorp Release 1 2009 34.5 20 10 / 7 Release 1
Intercorp Release 0 2008 25.0 19 0 / 0 Release 0

The corpus consists of two parts: core and collections. The core of InterCorp consists mostly of fiction with manually checked alignments. Collections are texts acquired in multiple languages, processed and aligned automatically: concordances may include more misaligned segments. Moreover, collection do not always include all texts from the original source, such as texts without a Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText (previously also via NoSketch Engine and Park). There is a Czech tutorial on Kontext.

Specifying a parallel query
Result of a query for substrings lieb and lov

Contacts

Project coordination, technical support and web pages administration: martin.vavrin(at mark)ff.cuni.cz

Project administration: alexandr.rosen(at mark)ff.cuni.cz, lucie.novakova(at mark)ff.cuni.cz

Discussion group: intercorp(at mark)ff.cuni.cz - group address, please use only in justified cases

Participants

Project administration

Software and technical support

Bc. Martin Vavřín
Institute of the Czech National Corpus

Mgr. Bc. Adrian Zasina, Ph.D.
Institute of the Czech National Corpus

Coordinators for specific languages

Arabic
Doc. PhDr. Petr Zemánek CSc.
Institute of Comparative Linguistics
PhDr. Jiří Milička, Ph.D.
Institute of the Czech National Corpus
Belarusian
PhDr. Veranika Bialkovich
Bulgarian
Prof. PhDr. Hana Gladkova, CSc.
Department of South Slavonic and Balkan Studies
Mgr. Natalie Kalajdžievová Ph.D.
Department of South Slavonic and Balkan Studies
Catalan
Mgr. Andreu Bauçà i Sastre, Ph.D.
Centre Carlemany de Llengua Catalana, Department of Romance Studies,
Ensenyament Superior, Recerca i Ajuts a l’Estudi, Govern d'Andorra
Chinese
Mgr. Vlastimil Dobečka
Department of Asian Studies, Faculty of Arts, Palacký University, Olomouc
Croatian
Mgr. Karel Jirásek, Ph.D.
Department of South Slavonic and Balkan Studies
Danish
Mgr. Jana Pavlisová
Mgr. Kateřina Haušildová
Department of Germanic Studies
Dutch
Mgr. Eliška Boková
PhDr. Zdenka Hrnčířová
Department of Germanic Studies
English
Mgr. Denisa Šebestová
Department of English Language and ELT Methodology
doc. PhDr. Markéta Malá, Ph.D.
Department of Linguistics
Mgr. Michal Kubánek
Department of English and American Studies, Faculty of Arts, Palacký University Olomouc
Finnish
Mgr. Lenka Fárová, Ph.D.
Department of Germanic Studies
French
PhDr. Olga Nádvorníková Ph.D.
Department of Romance Studies
German
Mgr. Štěpán Zbytovský, Ph.D.
Department of Germanic Studies
Mgr. Tomáš Káňa, Ph.D.
Department of German Language and Literature, Faculty of Education, Masaryk University, Brno
PhDr. Hana Peloušková, Ph.D.
Department of German Language and Literature, Faculty of Education, Masaryk University, Brno
PhDr. Vít Dovalil, Ph.D.
Department of Germanic Studies
Hindi
Mgr. Nora Melnikova, Ph.D.
Institute of South and Central Asia
Bc. Vojtěch Diatka
Department of Linguistics
Hungarian
Mgr. Simona Kolmanová, Ph.D.
Department of Central European Studies
Italian
doc. Pavel Štichauer, Ph.D.
Department of Romance Studies
Japanese
Mgr. Petra Kanasugi, Ph.D.
Institute of East Asian Studies
Latvian
Mgr. Michal Škrabal, Ph.D.
Institute of the Czech National Corpus
Mgr. Marija Lazar
Lithuanian
Mgr. Věra Kociánová
RNDr. Hana Skoumalová, Ph.D.
Macedonian
PhDr. Michala Adamová
Institute of the Czech National Corpus
Mgr. Vojkan Milenković
Norwegian
Mgr. Pavel Vondřička Ph.D.
Institute of the Czech National Corpus
Polish
Mgr. Łucja Bańczyk
Dr. Renata Dybalska
Department of Central European Studies
Portuguese
PhDr. Jaroslava Jindrová Ph.D.
Department of Romance Studies
Romani
Ruben Pellar, Master of Arts, Ph.D.
Romanian
Ing. Alexandr Krestovský
Univerzita Karlova v Praze CERGE
Russian
PhDr. Natálie Rajnochová, Ph.D.
Department of East European Studies
Mgr. Naděžda Runštuková
Serbian
PhDr. Ana Adamovičová
Institute of Czech Studies
Slovak
doc. PhDr. Mira Nábělková CSc.
Department of East European Studies
Slovenian
Mgr. Leoš Soustružník
Mgr. David Blažek, Ph.D.
Institute of Slavonic Studies, Czech Academy of Sciences
Spanish
Doc. PhDr. Petr Čermák, Ph.D.
Department of Romance Studies
Swedish
Lenka John
Embassy of Sweden
Mgr. Silvie Cinková, Ph.D.
Ukrainian
Dr. Natalia Kotsyba

Citing InterCorp

Specific language combination: Author, 1., Author, 2., Author, 3.2): InterCorp – English, German 3), Release 10 of 1 December 2017. Institute of the Czech National Corpus, Charles University, Prague 2017. Available from: http://www.korpus.cz

Whole corpus: Rosen, A. – Vavřín, M. – Zasina, A. J.: InterCorp, Release 10 of 1 December 2017. Institute of the Czech National Corpus, Charles University, Prague 2017. Available from: http://www.korpus.cz

Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427. electronic version at IngentaConnect, preprint version

See also

1)
Total number of words in foreign texts
2)
You can find the list of authors for each language in KonText in general information about a corpus, which will show by clicking on the name of the corpus under the KonText logo.
3)
Fill in the languages you use.