InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.
After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
There are several aspects that make InterCorp special among the corpora published by ICNC:
The bulk of InterCorp consists of fiction in Czech and other languages, semi-automatically aligned, and a selection of political commentaries published by Project Syndicate. The currently available Czech, English, French, German, Russian and Spanish issues, dated 2000-2008, will be followed by more recent texts in future releases of the corpus. These texts have been aligned automatically: search results may include more misaligned segments.
Each texts has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The size of InterCorp for all languages is given in the following table, the Project Syndicate data being included in the counts with their size of approximately 1,5 - 2 million words for a given language (as of February 2011, in InterCorp v. 3, see Version history).
language | number of words (in thousands) | number of texts |
Bulgarian | 1,135 | 15 |
Croatian | 6,735 | 96 |
Danish | 190 | 5 |
Dutch | 3,914 | 58 |
English | 5,695 | Syndicate + 49 |
Finnish | 1,247 | 19 |
French | 3,141 | Syndicate + 21 |
German | 8,846 | Syndicate + 100 |
Hungarian | 1,123 | 17 |
Italian | 2,817 | 28 |
Latvian | 1,085 | 33 |
Lithuanian | 353 | 17 |
Norwegian | 2,158 | 21 |
Polish | 4,716 | 80 |
Portuguese | 1,312 | 18 |
Romanian | 671 | 5 |
Russian | 2,951 | Syndicate + 25 |
Serbian | 1,724 | 27 |
Slovak | 6,899 | 138 |
Slovene | 992 | 16 |
Spanish | 10,905 | Syndicate + 108 |
Swedish | 3,673 | 47 |
TOTAL | 72,280 | 943 |
Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Language | number of words (in thousands) | number of texts |
Czech | 41,340 | Syndicate + 652 |
Texts in the following languages have received some morphosyntactic annotation.
language | tags | lemmas | brief description | detailed description | tool |
English | ✔ | ✔ | in English | in English + additions | TreeTagger |
Bulgarian | ✔ | in English | TreeTagger | ||
Czech | ✔ | ✔ | in Czech in English1) | in English | Morče |
French | ✔ | ✔ | in English | TreeTagger | |
Italian | ✔ | ✔ | in English | TreeTagger | |
Lithuanian | ✔ | ✔ | in Czech and English | Vidas Daudaravičius | |
Hungarian | ✔ | in English | HunPos | ||
German | ✔ | ✔ | in German | in German | TreeTagger |
Dutch | ✔ | TreeTagger | |||
Norwegian | ✔ | ✔ | analyzer, tagger | ||
Polish | ✔ | ✔ | in English in Polish | in English | Morfeusz, TaKIPI |
Russian | ✔ | ✔ | in English | in English2) | TreeTagger |
Slovak | ✔ | ✔ | in Slovak | in Slovak | Radovan Garabík, Morče |
Spanish | ✔ | ✔ | in English | TreeTagger |
See Park Manual for advice on the use of tags in queries.
The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:
We are grateful for the possibility to use the following software and data:
Last update: 24 February 2011