InterCorp Release 8
Name | Czech – core | Czech – collections | other – core | other – collections | |
---|---|---|---|---|---|
Positions | Number of tokens | 105 239 198 | 117 981 673 | 233 509 950 | 1 560 655 498 |
Number of word forms | 84 718 325 | 89 645 545 | 194 055 340 | 1 229 043 791 | |
Structural attributes | Number of documents | 1 279 | 5 | 2 513 | 89 |
Number of div | 1 279 | 111 263 | 2 513 | 1 849 184 | |
Number of sentences | 7 250 794 | 13 588 082 | 14 377 637 | 143 478 514 | |
Further information | reference | YES | |||
representative | NO | ||||
publication date | 2015 | ||||
foreign languages | 38 | ||||
tagged languages | 20 | ||||
lemmatized languages | 17 |
Access to the texts
After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.
New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).
References
If you publish results based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. In your scientific publications please cite the following paper:
Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).
For more references see the repository of bibliographical items based on the CNC. All references to work using InterCorp are welcome. See here for details.
When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:
Rosen, A., Vavřín, M.: Korpus InterCorp – English, German1), version 7 from 19 Dec 2014. Institute of the Czech National Corpus, Charles University, Prague 2014. Available on-line: http://www.korpus.cz
Texts in the corpus
The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:
- Political commentaries published by Project Syndicate and Presseurop
- A package of legal texts of the European Union form the Acquis Communautaire corpus
- Proceedings of the European Parliament dated 2007–2011 from the Europarl corpus
- Film subtitles from the Open Subtitles database
These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 8 from May 2015 is 195 mil. words in the aligned foreign language texts in the core part and 1,229 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the sizes in millions of words.
Corpus size in thousands of words
Language | Core | Syndicate | Presseurop | Acquis | Europarl | Subtitles | Total | |
---|---|---|---|---|---|---|---|---|
ar | Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 34 |
be | Belarusian | 2 152 | 0 | 0 | 0 | 0 | 0 | 2 152 |
bg | Bulgarian | 5 240 | 0 | 0 | 13 816 | 9 083 | 0 | 28 140 |
ca | Catalan | 4 632 | 0 | 0 | 0 | 0 | 0 | 4 632 |
da | Danish | 3 016 | 0 | 0 | 21 679 | 13 915 | 14 429 | 53 042 |
de | German | 27 681 | 3 725 | 2 482 | 21 723 | 13 089 | 8 366 | 77 069 |
el | Greek | 0 | 0 | 0 | 25 069 | 15 403 | 23 714 | 64 187 |
en | English | 15 488 | 3 818 | 2 670 | 24 207 | 15 580 | 52 101 | 113 865 |
es | Spanish | 17 475 | 4 324 | 2 816 | 27 001 | 15 885 | 36 378 | 103 882 |
et | Estonian | 0 | 0 | 0 | 15 962 | 10 899 | 10 296 | 37 158 |
fi | Finnish | 3 426 | 0 | 0 | 16 455 | 10 175 | 15 097 | 45 154 |
fr | French | 9 170 | 4 393 | 2 928 | 27 351 | 17 178 | 25 961 | 86 983 |
he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16 221 | 16 221 |
hi | Hindu | 408 | 0 | 0 | 0 | 0 | 0 | 408 |
hr | Croatian | 15 479 | 0 | 0 | 0 | 0 | 19 092 | 34 572 |
hu | Hungarian | 5 387 | 0 | 0 | 19 176 | 12 306 | 21 239 | 58 110 |
is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1 584 | 1 584 |
it | Italian | 7 247 | 651 | 2 707 | 24 849 | 15 489 | 14 653 | 65 599 |
ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113 | 113 |
lt | Lithuanian | 358 | 0 | 0 | 18 392 | 11 212 | 557 | 30 521 |
lv | Latvian | 1 336 | 0 | 0 | 18 744 | 11 688 | 280 | 32 050 |
mk | Macedonian | 3 741 | 0 | 0 | 0 | 0 | 1 877 | 5 619 |
ms | Malay | 0 | 0 | 0 | 0 | 0 | 3 520 | 3 520 |
mt | Maltese | 0 | 0 | 0 | 14 133 | 0 | 0 | 14 133 |
nl | Dutch | 9 961 | 313 | 2 955 | 24 746 | 15 563 | 29 362 | 82 903 |
no | Norwegian | 4 815 | 0 | 0 | 0 | 0 | 0 | 4 815 |
pl | Polish | 17 516 | 0 | 2 378 | 20 627 | 12 811 | 26 572 | 79 905 |
pt | Portuguese | 2 393 | 369 | 2 999 | 28 602 | 16 484 | 43 391 | 94 241 |
ro | Romanian | 3 432 | 0 | 2 737 | 8 199 | 9 446 | 34 128 | 57 944 |
ru | Russian | 3 337 | 3 174 | 0 | 0 | 0 | 6 885 | 13 397 |
sk | Slovak | 7 401 | 0 | 0 | 19 222 | 12 734 | 5 134 | 44 493 |
sl | Slovenian | 900 | 0 | 0 | 19 645 | 12 240 | 17 024 | 49 810 |
sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2 003 | 2 003 |
sr | Serbian | 8 823 | 0 | 0 | 0 | 0 | 20 776 | 29 600 |
sv | Swedish | 8 138 | 0 | 0 | 20 585 | 13 840 | 14 693 | 57 258 |
tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21 190 | 21 190 |
uk | Ukrainian | 5 054 | 0 | 0 | 0 | 0 | 246 | 5 300 |
vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1 473 | 1 473 |
Subtotal | 194 055 | 20 769 | 24 676 | 430 195 | 265 029 | 488 372 | 1 423 098 | |
cs | Czech | 84 718 | 3 416 | 2 315 | 20 303 | 12 922 | 50 688 | 174 363 |
TOTAL | 278 773 | 24 185 | 26 991 | 450 498 | 277 951 | 539 060 | 1 597 462 |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Morphosyntactic annotation
Texts in the following languages have received some morphosyntactic annotation.
Language | Tags | Lemmas | Brief description | Detailed description | Tool |
---|---|---|---|---|---|
Bulgarian | ✔ | in English | TreeTagger | ||
Czech | ✔ | ✔ | in Czech in English2) | in English | Morče |
Dutch | ✔ | in Dutch | TreeTagger | ||
English | ✔ | ✔ | in English | in English + additions | TreeTagger |
Estonian | ✔ | ✔ | in Estonian and English | TreeTagger | |
Finnish | ✔ | ✔ | in English3) | OMorFi+HunPOS | |
French | ✔ | ✔ | in English | TreeTagger | |
German | ✔ | ✔ | in English4) | in German | RFTagger |
Hungarian | ✔ | in English | HunPos | ||
Icelandic | ✔ | ✔ | IceStagger | ||
Italian | ✔ | ✔ | in English | TreeTagger | |
Lithuanian | ✔ | ✔ | in Czech and English | in English | Author: Vidas Daudaravičius |
Norwegian | ✔ | ✔ | in English in Norwegian | analyzer, tagger | |
Polish | ✔ | ✔ | in English in Polish | in English | Morfeusz, TaKIPI |
Portuguese | ✔ | ✔ | Spanish | TreeTagger | |
Russian | ✔ | ✔ | in English | in English5) | TreeTagger |
Slovak | ✔ | ✔ | in Slovak | in Slovak | Radovan Garabík, Morče |
Slovene | ✔ | ✔ | English | totale | |
Spanish | ✔ | ✔ | in English | TreeTagger | |
Swedish | ✔ | ✔ | Stagger |
Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.
Morphological tags including characters with a special meaning in regular expressions, e.g. “$” in the English tag “wp$”, must be preceded in queries by a backslash: tag=“wp\$”.
Structural attributes
Structure | Attribute | Description | Values |
---|---|---|---|
doc | doc.id | unique document identifier | text |
doc.lang | language | ar / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh | |
doc.version | version | number | |
doc.wordcount | document size in words | number | |
div | div.id | text identification | author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE |
div.group | division in | Core / Acquis / Europarl / PressEurop / Subtitles / Syndicate | |
div.wordcount | number of words | number | |
div.author | author | last name, first name | |
div.title | full title | text | |
div.publisher | publisher | text | |
div.pubplace | publication place | text | |
div.pubyear | publication year | date | |
div.txtype | text type | discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles | |
div.original | is the text an original? | Yes / No | |
div.srclang | language of the original | ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh | |
div.translator | translator | last name, first name | |
div.transsex | translator's gender | F / M | |
div.authsex | author's gender | F / M | |
p | p.id | unique paragraph identifier | text |
s | s.id | unique sentence identifier | text |
Number of texts in the core of the corpus by languages of the text and languages of the original
Language of the original | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
↓ Language of the text | ar | be | bg | ca | cs | da | de | en | es | fi | fr | hi | hr | hu | it | lt | lv | mk | nl | no | pl | pt | ro | ru | sk | sl | sr | sv | uk | total | other |
ar | 1 | 1 | 1 | 3 | |||||||||||||||||||||||||||
be | 3 | 8 | 4 | 13 | 1 | 1 | 1 | 3 | 2 | 1 | 1 | 1 | 39 | ||||||||||||||||||
bg | 19 | 9 | 1 | 27 | 4 | 2 | 1 | 1 | 2 | 2 | 68 | ||||||||||||||||||||
ca | 1 | 16 | 3 | 12 | 5 | 1 | 2 | 3 | 1 | 1 | 45 | 1 | |||||||||||||||||||
cs | 1 | 3 | 19 | 1 | 267 | 9 | 134 | 242 | 127 | 24 | 95 | 2 | 26 | 1 | 20 | 1 | 7 | 1 | 30 | 7 | 49 | 21 | 39 | 56 | 3 | 8 | 58 | 6 | 1257 | ||
da | 6 | 9 | 12 | 27 | |||||||||||||||||||||||||||
de | 85 | 126 | 65 | 10 | 1 | 4 | 1 | 7 | 1 | 1 | 6 | 3 | 3 | 2 | 3 | 1 | 3 | 5 | 327 | ||||||||||||
en | 25 | 4 | 125 | 3 | 1 | 2 | 1 | 1 | 6 | 5 | 4 | 177 | 1 | ||||||||||||||||||
es | 1 | 25 | 8 | 29 | 126 | 1 | 6 | 7 | 1 | 4 | 2 | 3 | 213 | 1 | |||||||||||||||||
fi | 11 | 1 | 1 | 12 | 2 | 25 | 1 | 1 | 1 | 2 | 57 | 1 | |||||||||||||||||||
fr | 36 | 1 | 10 | 83 | 2 | 1 | 2 | 2 | 137 | ||||||||||||||||||||||
hi | 2 | 1 | 1 | 2 | 1 | 7 | |||||||||||||||||||||||||
hr | 1 | 71 | 15 | 52 | 11 | 2 | 4 | 26 | 6 | 7 | 1 | 3 | 4 | 1 | 1 | 8 | 213 | 2 | |||||||||||||
hu | 16 | 5 | 23 | 9 | 1 | 3 | 14 | 71 | |||||||||||||||||||||||
it | 4 | 4 | 21 | 9 | 1 | 3 | 19 | 3 | 1 | 3 | 68 | 1 | |||||||||||||||||||
lt | 8 | 2 | 2 | 1 | 1 | 2 | 1 | 17 | |||||||||||||||||||||||
lv | 22 | 2 | 1 | 1 | 7 | 2 | 1 | 36 | |||||||||||||||||||||||
mk | 15 | 1 | 16 | 1 | 1 | 1 | 2 | 1 | 3 | 2 | 2 | 4 | 49 | ||||||||||||||||||
nl | 24 | 3 | 33 | 7 | 3 | 3 | 30 | 2 | 2 | 3 | 3 | 6 | 119 | ||||||||||||||||||
no | 11 | 5 | 21 | 4 | 1 | 3 | 6 | 2 | 1 | 54 | |||||||||||||||||||||
pl | 36 | 8 | 97 | 10 | 2 | 8 | 2 | 1 | 1 | 3 | 1 | 46 | 4 | 6 | 1 | 5 | 231 | 1 | |||||||||||||
pt | 6 | 8 | 15 | 29 | |||||||||||||||||||||||||||
ro | 7 | 5 | 12 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 33 | 3 | |||||||||||||||||||
ru | 9 | 1 | 22 | 2 | 1 | 1 | 22 | 1 | 3 | 62 | 1 | ||||||||||||||||||||
sk | 55 | 2 | 5 | 1 | 1 | 2 | 56 | 122 | 18 | ||||||||||||||||||||||
sl | 7 | 1 | 2 | 1 | 2 | 2 | 15 | ||||||||||||||||||||||||
sr | 11 | 7 | 33 | 9 | 3 | 7 | 2 | 4 | 3 | 10 | 1 | 5 | 2 | 97 | 3 | ||||||||||||||||
sv | 11 | 4 | 23 | 7 | 2 | 1 | 1 | 50 | 99 | 1 | |||||||||||||||||||||
uk | 6 | 1 | 31 | 3 | 5 | 2 | 5 | 3 | 5 | 6 | 67 | ||||||||||||||||||||
total | 2 | 6 | 39 | 3 | 810 | 19 | 349 | 950 | 335 | 57 | 241 | 4 | 56 | 2 | 89 | 5 | 18 | 3 | 84 | 22 | 128 | 72 | 119 | 118 | 6 | 26 | 164 | 12 |
- The table shows number of texts in the core of Intercorp.
- For each language which has texts in the core, number of texts by languages of the original (written in the caption) are shown. E. g. in Arabian, there is one Arabian, one Czech and one German original text in the core, that is total of three texts in Arabian (see the penultimate column).
- You can find out in columns, how many original texts in a language written in the caption are translated to other languages. Codes of these languages are in the first column. The last column shows the number of original texts in other languages, which are not in the core of Intercorp.
- In the diagonal, there is a number of original texts in a given language. E. g. in Hungarian and Romanian, there is none, in Romanian not even a translated one.
Acknowledgements
We are grateful for the possibility to use the following texts and software:
Texts:
- Fiction in many Slavic and some other languages from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
- Political commentaries in a number of languages from the site Project Syndicate
- Newspaper texts in a number of languages from the Presseurop/VoxEurop server
- Legal texts in EU languages from the JRC-ACQUIS corpus
- Proceedings of the European Parliament from the EuroParl corpus
- Slovak-Czech concordances from the Slovak National Corpus
- Short stories in a number of languages My 1989 from Goethe Institut
- A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
- George Orwell's novel 1984 in a number of languages from the Multext-East corpus
- Ukrainian and Polish texts from the PolUkr corpus
- Film subtitles from the database Open Subtitles
Pre-processing
- parallel text editor InterText by Pavel Vondřička
- Aligner Hunalign
- Sentence splitter for Czech by Pavel Květoň
- Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
- Sentence splitter Punkt for all other languages from Natural Language Toolkit
Taggers/lemmatizers:
- TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
- HunPOS for Hungarian and other languages
- Tagger for Slovak (thanks to Radovan Garabík)
- Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
- Tagger for Norwegian (thanks to Pavel Vondřička)
- totale for Slovene (thanks to Tomaž Erjavec)
- RFTagger for German
- OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
- Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)