This is an old revision of the document!
InterCorp: Release 5
InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.
After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
There are several aspects that make InterCorp special among the corpora published by ICNC:
- Non-parallel versions of all InterCorp bitexts are available through a web-based version of the interface Bonito, so that its search and processing features (filter, sort, collocations, frequency distribution, random sampling, etc.) can be used also with texts from the parallel corpus. Czech texts included in the parallel corpus can also be used this way.
- InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing.
Texts in the corpus
The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The current choice includes political commentaries published by Project Syndicate and Presseurop, and a package of legal texts Acquis Communautaire. These texts have been aligned automatically: search results may include a higher number of misaligned segments.
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 5 (see Version history) is 91,529,000 words in the aligned foreign language texts in the core part and 451,112,000 in the collections. The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.
Corpus size in thousands of words
Language | Core | Syndicate | Presseurop | Acquis | Total |
---|---|---|---|---|---|
be | 67 | 0 | 0 | 0 | 67 |
bg | 1 415 | 0 | 0 | 13 816 | 15 231 |
da | 189 | 0 | 0 | 21 680 | 21 869 |
de | 12 004 | 2 567 | 752 | 21 723 | 37 048 |
el | 0 | 0 | 0 | 25 069 | 25 069 |
en | 5 914 | 2 567 | 799 | 24 207 | 33 489 |
es | 11 811 | 2 896 | 860 | 27 001 | 42 570 |
et | 0 | 0 | 0 | 15 962 | 15 962 |
fi | 2 081 | 0 | 0 | 16 667 | 18 748 |
fr | 3 217 | 2 969 | 874 | 27 351 | 34 414 |
hr | 8 103 | 0 | 0 | 0 | 8 103 |
hu | 1 122 | 0 | 0 | 19 167 | 20 290 |
it | 3 484 | 80 | 793 | 24 849 | 29 207 |
lt | 352 | 0 | 0 | 18 432 | 18 785 |
lv | 1 085 | 0 | 0 | 18 744 | 19 830 |
mk | 32 | 0 | 0 | 0 | 32 |
mt | 0 | 0 | 0 | 14 133 | 14 133 |
nl | 6 486 | 0 | 899 | 24 746 | 32 132 |
no | 2 301 | 0 | 0 | 0 | 2 301 |
pl | 8 396 | 0 | 710 | 20 464 | 29 571 |
pt | 2 127 | 0 | 913 | 28 599 | 31 640 |
ro | 1 370 | 0 | 819 | 8 199 | 10 389 |
ru | 1 664 | 2 304 | 0 | 0 | 3 969 |
sk | 7 257 | 0 | 0 | 19 221 | 26 479 |
sl | 991 | 0 | 0 | 19 645 | 20 636 |
sr | 4 295 | 0 | 0 | 0 | 4 295 |
sv | 5 754 | 0 | 0 | 20 615 | 26 369 |
Total | 91 528 | 13 385 | 7 425 | 430 300 | 542 640 |
cs | 52 651 | 2 285 | 704 | 20 285 | 75 926 |
*) Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Morphosyntactic annotation
Texts in the following languages have received some morphosyntactic annotation.
language | tags | lemmas | brief description | detailed description | tool |
---|---|---|---|---|---|
Bulgarian | ✔ | in English | TreeTagger | ||
Czech | ✔ | ✔ | in Czech in English1) | in English | Morče |
Dutch | ✔ | TreeTagger | |||
English | ✔ | ✔ | in English | in English + additions | TreeTagger |
Estonian | ✔ | ✔ | Estonian and English | TreeTagger | |
French | ✔ | ✔ | in English | TreeTagger | |
German | ✔ | ✔ | in German | in German | TreeTagger |
Hungarian | ✔ | in English | HunPos | ||
Italian | ✔ | ✔ | in English | TreeTagger | |
Lithuanian | ✔ | ✔ | in Czech and English | Vidas Daudaravičius | |
Norwegian | ✔ | ✔ | in English in Norwegian | analyzer, tagger | |
Polish | ✔ | ✔ | in English in Polish | in English | Morfeusz, TaKIPI |
Portuguese | ✔ | ✔ | Spanish | TreeTagger | |
Russian | ✔ | ✔ | in English | in English2) | TreeTagger |
Slovak | ✔ | ✔ | in Slovak | in Slovak | Radovan Garabík, Morče |
Slovene | ✔ | ✔ | English | totale | |
Spanish | ✔ | ✔ | in English | TreeTagger |
See Park Manual for advice on the use of tags in queries.
Please note: work in progress
The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:
Acknowledgements
We are grateful for the possibility to use the following software and data:
Pre-processing
- Sentence splitter for Czech by Pavel Květoň
- Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
- Sentence splitter Punkt for all other languages from Natural Language Toolkit
- Aligner Hunalign
Taggers/lemmatizers:
- Morče for Czech
- TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, German, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
- HunPOS for Hungarian
- Tagger for Slovak (thanks to Radovan Garabík)
- Tagger for Lithuanian (thanks to Hanka Skoumalová)
- totale for Slovene (thanks to Tomaž Erjavec)
Corpus Query Engine:
Data:
- Newspaper articles in a number of languages from the site Project Syndicate
- Slovak-Czech concordances from the Slovak National Corpus
- Short stories in a number of languages My 1989 from Goethe Institut
- A number of texts in the Czech-Lithuanian section of the corpus from Patrick Corness
- George Orwell's novel 1984 in a number of languages from the Multext-East corpus
- Ukrainian and Polish texts from the PolUkr corpus (in prep.)
- Newspaper texts from the Presseurop server
- Legal texts in EU languages from the JRC-ACQUIS corpus
Last update: 20 July 2012