Skrýt
Nastavení

This is an old revision of the document!


InterCorp

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. It serves as a source of data for theoretical studies, lexicography, student research, (foreign) language learning, computer applications, translators and also for the general public.

All texts in InterCorp and all features of the search interface are available after free registration and login. The registration is identical for all public ICNC corpora. No special registration for InterCorp is required if you already have user login and password for the Czech part of InterCorp.

InterCorp is a part of the Czech National Corpus, a project funded by the Ministry of Education of the Czech Republic within the programme Large Research, Development and Innovation Infrastructures (LM2015044; 2016-2019). In 2012-2015 and 2005-2011 the project was supported from the same source (projects no. LM2011023 and 0021620823, respectively). The entire project is academic and non-commercial.

Description

Starting with Release 6, InterCorp can be seen as referential: all its previous releases stay available in their originally published form. The volume of texts, the number of languages and the extent of annotation (lemmatization and tagging) may grow with each new release and the introduction of new tools.

For more details about the individual releases of InterCorp see the overview below:

Release Publication year Number of words in millions1) Number of foreign languages Tagged / lemmatized List of changes
Intercorp Release 9 2016 1 460,0 39 23 / 20 Release 9
Intercorp Release 8 2015 1 423,0 38 20 / 17 Release 8
Intercorp Release 7 2014 1 390,0 38 20 / 17 Release 7
Intercorp Release 6 2013 867,3 31 17 / 14 Release 6
Intercorp Release 5 2012 542,6 27 17 / 14 Release 5
Intercorp Release 4 2011 92,3 22 13 / 10 Release 4
Intercorp Release 3 2011 72,3 22 13 / 10 Release 3
Intercorp Release 2 2009 49,3 21 10 / 7 Release 2
Intercorp Release 1 2009 34,5 20 10 / 7 Release 1
Intercorp Release 0 2008 25,0 19 0 / 0 Release 0

The corpus consists of two parts: core and collections. The core of InterCorp consists mostly of fiction with manually checked alignments. Collections are texts acquired in multiple languages, processed and aligned automatically: concordances may include more misaligned segments. Moreover, collection do not always include all texts from the original source, such as texts without a Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText (previously also via NoSketch Engine and Park). There is a Czech tutorial on Kontext.

Specifying a parallel query
Result of a query for substrings lieb and lov

Contacts

Project coordination, technical support and web pages administration: martin.vavrin(at mark)ff.cuni.cz

Project administration: alexandr.rosen(at mark)ff.cuni.cz, lucie.novakova(at mark)ff.cuni.cz

Discussion group: intercorp(at mark)ff.cuni.cz - group address, please use only in justified cases

Participants

Project administration

Software and technical support

Coordinators for specific languages

Arabic
Doc. PhDr. Petr Zemánek CSc.
Ústav srovnávací jazykovědy
Mgr. Jiří Milička
Ústav srovnávací jazykovědy
Belarusian
PhDr. Veranika Bialkovich
Bulgarian
Prof. PhDr. Hana Gladkova, CSc.
Ústav slavistických a východoevropských studií
Mgr. Natalie Kalajdžievová Ph.D.
Katedra jihoslovanských a balkanistických studií
Catalan
Mgr. Andreu Bauçà i Sastre, PhD.
Lektorát katalánského jazyka, Ústav románských studií
Mgr. Joan Ramon Marina Amat
Ústav vysokoškolského vzdělávání a výzkumu, Ministerstvo školství a mládeže, Andora
Croatian
Mgr. Karel Jirásek, Ph.D.
Katedra jihoslovanských a balkanistických studií
Danish
Mgr. Jana Pavlisová
Mgr. Kateřina Haušildová
Ústav germánských studií
Dutch
Mgr. Eliška Boková
PhDr. Zdenka Hrnčířová
Ústav germánských studií
English
Prof. PhDr. Aleš Klégr
Ústavu anglického jazyka a didaktiky
PhDr. Markéta Malá, Ph.D.
Ústavu anglického jazyka a didaktiky
PhDr. Pavlína Šaldová, Ph.D.
Ústavu anglického jazyka a didaktiky
Mgr. Leona Rohrauer
Ústavu anglického jazyka a didaktiky
Mgr. Michal Kubánek
Katedra anglistiky a amerikanistiky UP
Finnish
Mgr. Lenka Fárová, Ph.D.
Ústav lingvistiky a ugrofinistiky
French
PhDr. Olga Nádvorníková Ph.D.
Ústav románských studií
German
PhDr. Vít Dovalil, Ph.D.
Ústav germánských studií
Mgr. Štěpán Zbytovský, Ph.D.
Ústav germánských studií
Mgr. Tomáš Káňa, Ph.D.
Katedra německého jazyka a literatury PeF MU v Brně
PhDr. Hana Peloušková, Ph.D.
Katedra německého jazyka a literatury PeF MU v Brně
Hindi
Bc. Vojtěch Diatka
Ústav obecné lingvistiky
Hungarian
Mgr. Simona Kolmanová, Ph.D.
Katedra středoevropských studií
Italian
doc. Pavel Štichauer, Ph.D.
Ústav románských studií
Latvian
Mgr. Michal Škrabal, Ph.D.
Ústav slavistických a východoevropských studií
Lithuanian
RNDr. Hana Skoumalová, Ph.D.
Ústav teoretické a komputační lingvistiky
Macedonian
PhDr. Michala Adamová
Ústav Českého národního korpusu
Mgr. Vojkan Milenkovik
Ústav slavistických a východoevropských studií
Norwegian
Mgr. Pavel Vondřička Ph.D.
Ústav Českého národního korpusu
Polish
Mgr. Łucja Bańczyk
Dr. Renata Dybalska
Ústav slavistických a východoevropských studií
Portuguese
PhDr. Jaroslava Jindrová Ph.D.
Ústav románských studií
Romani
Ruben Pellar, Master of Arts, Ph.D.
Romanian
Ing. Alexandr Krestovský
Univerzita Karlova v Praze CERGE
Russian
PhDr. Natálie Rajnochová, Ph.D.
Ústav slavistických a východoevropských studií
Mgr. Naděžda Runštuková
Serbian
PhDr. Ana Adamovičová
Ústav bohemistických studií
Slovak
doc. PhDr. Mira Nábělková CSc.
Ústav slavistických a východoevropských studií
Slovenian
Mgr. Leoš Soustružník
Mgr. David Blažek, Ph.D.
Slovanský ústav AV ČR
Spanish
Doc. PhDr. Petr Čermák, Ph.D.
Ústav románských studií
Swedish
Mgr. Silvie Cinková, Ph.D.
Ústav formální a aplikované lingvistiky MFF UK
Ukrainian
Dr. Natalia Kotsyba

Citing InterCorp

Specific language combination: Author, 1., Author, 2., Author, 3.2): InterCorp – English, German 3), Release 8 of 4 June 2015. Institute of the Czech National Corpus, Charles University, Prague 2015. Available from: http://www.korpus.cz

Whole corpus: Rosen, A., Vavřín, M.: InterCorp, Release 8 of 4 June 2015. Institute of the Czech National Corpus, Charles University, Prague 2015. Available from: http://www.korpus.cz

Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427. bibtex, electronic version at IngentaConnect, preprint version

See also

1)
Total number of words in foreign texts
2)
You can find the list of authors for each language in KonText in general information about a corpus, which will show by clicking on the name of the corpus under the KonText logo.
3)
Fill in the languages you use.