This is an old revision of the document!
Table of Contents
InterCorp
InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. It serves as a source of data for theoretical studies, lexicography, student research, (foreign) language learning, computer applications, translators and also for the general public.
All texts in InterCorp and all features of the search interface are available after free registration and login via KonText or Treq interface. The registration is identical for all public ICNC corpora. No special registration for InterCorp is required if you already have user login and password for the Czech part of InterCorp.
InterCorp is a part of the Czech National Corpus, a project funded by the Ministry of Education of the Czech Republic within the programme Large Research, Development and Innovation Infrastructures (LM2015044; 2016-2019). In 2012-2015 and 2005-2011 the project was supported from the same source (projects no. LM2011023 and 0021620823, respectively). The entire project is academic and non-commercial.
Description
Starting with Release 6, InterCorp can be seen as referential: all its previous releases stay available in their originally published form. The volume of texts, the number of languages and the extent of annotation (lemmatization and tagging) may grow with each new release and the introduction of new tools.
For more details about the individual releases of InterCorp see the overview below:
Release | Publication year | Number of words in millions1) | Number of foreign languages | Tagged / lemmatized | List of changes |
---|---|---|---|---|---|
Intercorp Release 12 | 2019 | 1 533.7 | 40 | 26 / 25 | Release 12 |
Intercorp Release 11 | 2018 | 1 508.4 | 39 | 26 / 25 | Release 11 |
Intercorp Release 10 | 2017 | 1,483.8 | 39 | 23 / 22 | Release 10 |
Intercorp Release 9 | 2016 | 1,460.0 | 39 | 23 / 20 | Release 9 |
Intercorp Release 8 | 2015 | 1,423.0 | 38 | 20 / 17 | Release 8 |
Intercorp Release 7 | 2014 | 1,390.0 | 38 | 20 / 17 | Release 7 |
Intercorp Release 6 | 2013 | 867.3 | 31 | 17 / 14 | Release 6 |
Intercorp Release 5 | 2012 | 542.6 | 27 | 17 / 14 | Release 5 |
Intercorp Release 4 | 2011 | 92.3 | 22 | 13 / 10 | Release 4 |
Intercorp Release 3 | 2011 | 72.3 | 22 | 13 / 10 | Release 3 |
Intercorp Release 2 | 2009 | 49.3 | 21 | 10 / 7 | Release 2 |
Intercorp Release 1 | 2009 | 34.5 | 20 | 10 / 7 | Release 1 |
Intercorp Release 0 | 2008 | 25.0 | 19 | 0 / 0 | Release 0 |
The corpus consists of two parts: core and collections. The core of InterCorp consists mostly of fiction with manually checked alignments. Collections are texts acquired in multiple languages, processed and aligned automatically: concordances may include more misaligned segments. Moreover, collection do not always include all texts from the original source, such as texts without a Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions.
InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText (previously also via NoSketch Engine and Park). There is a Czech tutorial on Kontext.
Contacts
Project coordination, technical support and web pages administration: martin.vavrin(at mark)ff.cuni.cz
Project administration: alexandr.rosen(at mark)ff.cuni.cz, lucie.novakova(at mark)ff.cuni.cz
Discussion group: intercorp(at mark)ff.cuni.cz - group address, please use only in justified cases
Participants
Project administration
Software and technical support
Bc. Martin Vavřín
Institute of the Czech National Corpus
Mgr. Bc. Adrian Zasina, Ph.D.
Institute of the Czech National Corpus
Coordinators for specific languages
Citing InterCorp
Specific language combination: Author, 1., Author, 2., Author, 3.2): InterCorp – English, German 3), Release 10 of 1 December 2017. Institute of the Czech National Corpus, Charles University, Prague 2017. Available from: http://www.korpus.cz
Whole corpus: Rosen, A. – Vavřín, M. – Zasina, A. J.: InterCorp, Release 10 of 1 December 2017. Institute of the Czech National Corpus, Charles University, Prague 2017. Available from: http://www.korpus.cz
Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427. electronic version at IngentaConnect, preprint version