InterCorp: Release 6
Name | Czech – core | Czech – collections | other – core | other – collections |
Positions | Number of tokens | 76 861 107 | 46 880 365 | 167 141 155 | 890 129 077 |
Number of word forms | 61 962 499 | 37 584 764 | 138 762 949 | 728 507 959 |
Structural attributes | Number of documents | 996 | 4 | 1 939 | 56 |
Number of div | 996 | 96 988 | 1 939 | 1 728 492 |
Number of sentences | 5 254 361 | 2 392 808 | 10 283 732 | 44 113 753 |
Further information | reference | YES |
representative | NO |
publication date | 2013 |
foreign languages | 31 |
tagged languages | 17 |
lemmatized languages | 14 |
InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.
Access to the texts
After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
InterCorp can be accessed via a standard web browser in two ways:
-
From
Park, a purpose-built interface. A brief user manual is available
here.
Both search interfaces are based on the Manatee corpus engine and access identical texts. Park can also be used to search the previous version of the corpus.
After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. If you are interested, please contact us at the address below.
Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental. With each new release, its size, or even the number of languages and the extent and quality of annotation may grow.
References
In results of your work based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. You might also consider adding the following reference in your scientific publications:
Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427
(bibtex1), electronic edition at //ing entaConnect//, preprint version).
For more references see here. Additional references to work using InterCorp are welcome. Please let us know at the e-mail address below.
Texts in the corpus
The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The current choice includes political commentaries published by Project Syndicate and Presseurop, a package of legal texts of the European Union form the Acquis Communautaire corpus, and proceedings of the European Parliament dated 2007–2011 from the Europarl corpus. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Some texts from the Acquis Communautaire a Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. Moreover, even some core texts in the current release no. 6 are temporarily aligned only automatically without manual checking. This concerns a part of texts acquired from ASPAC – Amsterdam Slavic Parallel Aligned Corpus. Alignment of these texts will be checked and corrected before the next release.
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.
Setup of the parallel corpus – the core
Setup of the parallel corpus – collections
Corpus size in thousands of words
Language Code | Language | Core | Syndicate | Presseurop | Acquis | Europarl | Total |
ar | Arabic | 29 | 0 | 0 | 0 | 0 | 29 |
be | Belarusian | 1 308 | 0 | 0 | 0 | 0 | 1 308 |
bg | Bulgarian | 3 979 | 0 | 0 | 13 816 | 9 083 | 26 879 |
ca | Catalan | 1 758 | 0 | 0 | 0 | 0 | 1 758 |
da | Danish | 190 | 0 | 0 | 21 680 | 13 916 | 35 785 |
de | German | 17 256 | 3 050 | 1 715 | 21 724 | 13 089 | 56 835 |
el | Greek | 210 | 0 | 0 | 25 070 | 15 404 | 40 683 |
en | English | 10 019 | 3 083 | 1 863 | 24 208 | 15 580 | 54 753 |
es | Spanish | 14 552 | 3 479 | 1 948 | 27 001 | 15 885 | 62 865 |
et | Estonian | 0 | 0 | 0 | 15 963 | 10 900 | 26 862 |
fi | Finnish | 2 131 | 0 | 0 | 16 667 | 10 241 | 29 040 |
fr | French | 3 816 | 3 535 | 2 054 | 27 352 | 17 178 | 53 936 |
hi | Hindi | 155 | 0 | 0 | 0 | 0 | 155 |
hr | Croatian | 12 625 | 0 | 0 | 0 | 0 | 12 625 |
hu | Hungarian | 2 511 | 0 | 0 | 19 168 | 12 307 | 33 985 |
it | Italian | 4 081 | 247 | 1 893 | 24 850 | 15 489 | 46 560 |
lt | Lithuanian | 358 | 0 | 0 | 18 433 | 11 020 | 29 811 |
lv | Latvian | 1 337 | 0 | 0 | 18 745 | 11 689 | 31 770 |
mk | Macedonian | 2 664 | 0 | 0 | 0 | 0 | 2 664 |
mt | Maltese | 0 | 0 | 0 | 14 133 | 0 | 14 133 |
nl | Dutch | 9 426 | 0 | 2 082 | 24 746 | 15 563 | 51 817 |
no | Norwegian | 2 301 | 0 | 0 | 0 | 0 | 2 301 |
pl | Polish | 12 710 | 0 | 1 660 | 20 464 | 12 805 | 47 640 |
pt | Portuguese | 2 318 | 0 | 2 103 | 28 599 | 16 481 | 49 502 |
ro | Romanian | 2 433 | 0 | 1 917 | 8 200 | 9 446 | 21 995 |
ru | Russian | 4 937 | 2 651 | 0 | 0 | 0 | 7 588 |
sk | slovenština | 8 152 | 0 | 0 | 19 222 | 12 734 | 40 108 |
sl | Slovene | 1 855 | 0 | 0 | 19 646 | 12 241 | 33 741 |
sr | Serbian | 6 972 | 0 | 0 | 0 | 0 | 6 972 |
sv | Swedish | 7 205 | 0 | 0 | 20 615 | 13 874 | 41 694 |
uk | Ukrainian | 1 493 | 0 | 0 | 0 | 0 | 1 493 |
Total | | 138 779 | 16 044 | 17 237 | 430 300 | 264 926 | 867 287 |
cs | Czech | 61 962 | 2 741 | 1 639 | 20 285 | 12 920 | 99 547 |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Morphosyntactic annotation
Texts in the following languages have received some morphosyntactic annotation.
See Park Manual for advice on the use of tags in queries.
… on the content of the corpus and on the search interfaces are welcome at
martin.vavrin@ff.cuni.cz
Acknowledgements
We are grateful for the possibility to use the following texts and software:
Texts:
<a id=“logo” href=“http://www.project-syndicate.org/|<img class=“nodeco” alt=“The highest quality commentaries and analysis from distinguished voices across the world.” title=“The highest quality commentaries and analysis from distinguished voices across the world.” src=“img/ProjectSyndycateLogo.jpg|]]
Pre-processing
Taggers/lemmatizers:
Corpus Query Engine:
Last update: 2 February 2014
See also