InterCorp: Release 6
Name | Czech – core | Czech – collections | other – core | other – collections | |
---|---|---|---|---|---|
Positions | Number of tokens | 76 861 107 | 46 880 365 | 167 141 155 | 890 129 077 |
Number of word forms | 61 962 499 | 37 584 764 | 138 762 949 | 728 507 959 | |
Structural attributes | Number of documents | 996 | 4 | 1 939 | 56 |
Number of div | 996 | 96 988 | 1 939 | 1 728 492 | |
Number of sentences | 5 254 361 | 2 392 808 | 10 283 732 | 44 113 753 | |
Further information | reference | YES | |||
representative | NO | ||||
publication date | 2013 | ||||
foreign languages | 31 | ||||
tagged languages | 17 | ||||
lemmatized languages | 14 |
InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.
Access to the texts
After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
InterCorp can be accessed via a standard web browser in two ways:
- InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
Both search interfaces are based on the Manatee corpus engine and access identical texts. Park can also be used to search the previous version of the corpus.
After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. If you are interested, please contact us at the address below.
Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental. With each new release, its size, or even the number of languages and the extent and quality of annotation may grow.
References
In results of your work based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex1), electronic edition at //ing entaConnect//, preprint version).
For more references see here. Additional references to work using InterCorp are welcome. Please let us know at the e-mail address below.
Texts in the corpus
The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The current choice includes political commentaries published by Project Syndicate and Presseurop, a package of legal texts of the European Union form the Acquis Communautaire corpus, and proceedings of the European Parliament dated 2007–2011 from the Europarl corpus. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Some texts from the Acquis Communautaire a Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. Moreover, even some core texts in the current release no. 6 are temporarily aligned only automatically without manual checking. This concerns a part of texts acquired from ASPAC – Amsterdam Slavic Parallel Aligned Corpus. Alignment of these texts will be checked and corrected before the next release.
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.
Corpus size in thousands of words
Language Code | Language | Core | Syndicate | Presseurop | Acquis | Europarl | Total |
---|---|---|---|---|---|---|---|
ar | Arabic | 29 | 0 | 0 | 0 | 0 | 29 |
be | Belarusian | 1 308 | 0 | 0 | 0 | 0 | 1 308 |
bg | Bulgarian | 3 979 | 0 | 0 | 13 816 | 9 083 | 26 879 |
ca | Catalan | 1 758 | 0 | 0 | 0 | 0 | 1 758 |
da | Danish | 190 | 0 | 0 | 21 680 | 13 916 | 35 785 |
de | German | 17 256 | 3 050 | 1 715 | 21 724 | 13 089 | 56 835 |
el | Greek | 210 | 0 | 0 | 25 070 | 15 404 | 40 683 |
en | English | 10 019 | 3 083 | 1 863 | 24 208 | 15 580 | 54 753 |
es | Spanish | 14 552 | 3 479 | 1 948 | 27 001 | 15 885 | 62 865 |
et | Estonian | 0 | 0 | 0 | 15 963 | 10 900 | 26 862 |
fi | Finnish | 2 131 | 0 | 0 | 16 667 | 10 241 | 29 040 |
fr | French | 3 816 | 3 535 | 2 054 | 27 352 | 17 178 | 53 936 |
hi | Hindi | 155 | 0 | 0 | 0 | 0 | 155 |
hr | Croatian | 12 625 | 0 | 0 | 0 | 0 | 12 625 |
hu | Hungarian | 2 511 | 0 | 0 | 19 168 | 12 307 | 33 985 |
it | Italian | 4 081 | 247 | 1 893 | 24 850 | 15 489 | 46 560 |
lt | Lithuanian | 358 | 0 | 0 | 18 433 | 11 020 | 29 811 |
lv | Latvian | 1 337 | 0 | 0 | 18 745 | 11 689 | 31 770 |
mk | Macedonian | 2 664 | 0 | 0 | 0 | 0 | 2 664 |
mt | Maltese | 0 | 0 | 0 | 14 133 | 0 | 14 133 |
nl | Dutch | 9 426 | 0 | 2 082 | 24 746 | 15 563 | 51 817 |
no | Norwegian | 2 301 | 0 | 0 | 0 | 0 | 2 301 |
pl | Polish | 12 710 | 0 | 1 660 | 20 464 | 12 805 | 47 640 |
pt | Portuguese | 2 318 | 0 | 2 103 | 28 599 | 16 481 | 49 502 |
ro | Romanian | 2 433 | 0 | 1 917 | 8 200 | 9 446 | 21 995 |
ru | Russian | 4 937 | 2 651 | 0 | 0 | 0 | 7 588 |
sk | slovenština | 8 152 | 0 | 0 | 19 222 | 12 734 | 40 108 |
sl | Slovene | 1 855 | 0 | 0 | 19 646 | 12 241 | 33 741 |
sr | Serbian | 6 972 | 0 | 0 | 0 | 0 | 6 972 |
sv | Swedish | 7 205 | 0 | 0 | 20 615 | 13 874 | 41 694 |
uk | Ukrainian | 1 493 | 0 | 0 | 0 | 0 | 1 493 |
Total | 138 779 | 16 044 | 17 237 | 430 300 | 264 926 | 867 287 | |
cs | Czech | 61 962 | 2 741 | 1 639 | 20 285 | 12 920 | 99 547 |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Morphosyntactic annotation
Texts in the following languages have received some morphosyntactic annotation.
Language | Tags | Lemmas | Brief description | Detailed description | Tool |
Bulgarian | ✔ | in English | TreeTagger | ||
Czech | ✔ | ✔ | in Czech in English2) | in English | Morče |
Dutch | ✔ | in Dutch | TreeTagger | ||
English | ✔ | ✔ | in English | in English + additions | TreeTagger |
Estonian | ✔ | ✔ | Estonian and English | TreeTagger | |
French | ✔ | ✔ | in English | TreeTagger | |
German | ✔ | ✔ | in German | TreeTagger | |
Hungarian | ✔ | in English | HunPos | ||
Italian | ✔ | ✔ | in English | TreeTagger | |
Lithuanian | ✔ | ✔ | in Czech and English | in English | Author: Vidas Daudaravičius |
Norwegian | ✔ | ✔ | in English in Norwegian | analyzer, tagger | |
Polish | ✔ | ✔ | in English in Polish | in English | Morfeusz, TaKIPI |
Portuguese | ✔ | ✔ | Spanish | TreeTagger | |
Russian | ✔ | ✔ | in English | in English3) | TreeTagger |
Slovak | ✔ | ✔ | in Slovak | in Slovak | Radovan Garabík, Morče |
Slovene | ✔ | ✔ | English | totale | |
Spanish | ✔ | ✔ | in English | TreeTagger |
See Park Manual for advice on the use of tags in queries.
Problems, comments, suggestions
… on the content of the corpus and on the search interfaces are welcome at
Acknowledgements
We are grateful for the possibility to use the following texts and software:
Texts:
- Fiction in many Slavic and some other languages fromASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
- Political commentaries in a number of languages from the site Project Syndicate <br>
<a id=“logo” href=“http://www.project-syndicate.org/|<img class=“nodeco” alt=“The highest quality commentaries and analysis from distinguished voices across the world.” title=“The highest quality commentaries and analysis from distinguished voices across the world.” src=“img/ProjectSyndycateLogo.jpg|]]
- Newspaper texts in a number of languages from the Presseurop server
- Legal texts in EU languages from the JRC-ACQUIS corpus
- Proceedings of the European Parliament from the EuroParl corpus
- Slovak-Czech concordances from the Slovak National Corpus
- Short stories in a number of languages My 1989 from Goethe Institut
- A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
- George Orwell's novel 1984 in a number of languages from the Multext-East corpus
- Ukrainian and Polish texts from the PolUkr corpus (in prep.)
Pre-processing
- parallel text editor InterText by Pavel Vondřička
- Aligner Hunalign
- Sentence splitter for Czech by Pavel Květoň
- Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
- Sentence splitter Punkt for all other languages from Natural Language Toolkit
Taggers/lemmatizers:
- Morče for Czech
- TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, German, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
- HunPOS for Hungarian
- Tagger for Slovak (thanks to Radovan Garabík)
- Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
- Tagger for Norwegian (thanks to Pavel Vondřička)
- totale for Slovene (thanks to Tomaž Erjavec)
Corpus Query Engine:
Last update: 2 February 2014
See also
@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411–427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}