AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp: Release 6

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 76 861 107 46 880 365 167 141 155 890 129 077
Number of word forms 61 962 499 37 584 764 138 762 949 728 507 959
Structural attributes Number of documents 996 4 1 939 56
Number of div 996 96 988 1 939 1 728 492
Number of sentences 5 254 361 2 392 808 10 283 732 44 113 753
Further information reference YES
representative NO
publication date 2013
foreign languages 31
tagged languages 17
lemmatized languages 14

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser in two ways:

  • From the integrated search interface of the Czech National Corpus KonText. This interface provides options similar to its older version NoSketch Engine (see below).
  • From the older version of the integrated search interface of the Czech National Corpus NoSketch Engine. Basic instructions on how to use the interface to search the parallel corpus are available here.
  • From Park, a purpose-built interface. A brief user manual is available here.

Both search interfaces are based on the Manatee corpus engine and access identical texts. Park can also be used to search the previous version of the corpus.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. If you are interested, please contact us at the address below.

Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental. With each new release, its size, or even the number of languages and the extent and quality of annotation may grow.

References

In results of your work based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex1), electronic edition at //ing entaConnect//, preprint version).

For more references see here. Additional references to work using InterCorp are welcome. Please let us know at the e-mail address below.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The current choice includes political commentaries published by Project Syndicate and Presseurop, a package of legal texts of the European Union form the Acquis Communautaire corpus, and proceedings of the European Parliament dated 2007–2011 from the Europarl corpus. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Some texts from the Acquis Communautaire a Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. Moreover, even some core texts in the current release no. 6 are temporarily aligned only automatically without manual checking. This concerns a part of texts acquired from ASPAC – Amsterdam Slavic Parallel Aligned Corpus. Alignment of these texts will be checked and corrected before the next release.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.

Setup of the parallel corpus – the core
Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Code Language Core Syndicate Presseurop Acquis Europarl Total
ar Arabic 29 0 0 0 0 29
be Belarusian 1 308 0 0 0 0 1 308
bg Bulgarian 3 979 0 0 13 816 9 083 26 879
ca Catalan 1 758 0 0 0 0 1 758
da Danish 190 0 0 21 680 13 916 35 785
de German 17 256 3 050 1 715 21 724 13 089 56 835
el Greek 210 0 0 25 070 15 404 40 683
en English 10 019 3 083 1 863 24 208 15 580 54 753
es Spanish 14 552 3 479 1 948 27 001 15 885 62 865
et Estonian 0 0 0 15 963 10 900 26 862
fi Finnish 2 131 0 0 16 667 10 241 29 040
fr French 3 816 3 535 2 054 27 352 17 178 53 936
hi Hindi 155 0 0 0 0 155
hr Croatian 12 625 0 0 0 0 12 625
hu Hungarian 2 511 0 0 19 168 12 307 33 985
it Italian 4 081 247 1 893 24 850 15 489 46 560
lt Lithuanian 358 0 0 18 433 11 020 29 811
lv Latvian 1 337 0 0 18 745 11 689 31 770
mk Macedonian 2 664 0 0 0 0 2 664
mt Maltese 0 0 0 14 133 0 14 133
nl Dutch 9 426 0 2 082 24 746 15 563 51 817
no Norwegian 2 301 0 0 0 0 2 301
pl Polish 12 710 0 1 660 20 464 12 805 47 640
pt Portuguese 2 318 0 2 103 28 599 16 481 49 502
ro Romanian 2 433 0 1 917 8 200 9 446 21 995
ru Russian 4 937 2 651 0 0 0 7 588
sk slovenština 8 152 0 0 19 222 12 734 40 108
sl Slovene 1 855 0 0 19 646 12 241 33 741
sr Serbian 6 972 0 0 0 0 6 972
sv Swedish 7 205 0 0 20 615 13 874 41 694
uk Ukrainian 1 493 0 0 0 0 1 493
Total 138 779 16 044 17 237 430 300 264 926 867 287
cs Czech 61 962 2 741 1 639 20 285 12 920 99 547

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language Tags Lemmas Brief description Detailed description Tool
Bulgarian in English TreeTagger
Czech in Czech in English2) in English Morče
Dutch in Dutch TreeTagger
English in English in English + additions TreeTagger
Estonian Estonian and English TreeTagger
French in English TreeTagger
German in German TreeTagger
Hungarian in English HunPos
Italian in English TreeTagger
Lithuanian in Czech and English in English Author: Vidas Daudaravičius
Norwegian in English in Norwegian analyzer, tagger
Polish in English in Polish in English Morfeusz, TaKIPI
Portuguese Spanish TreeTagger
Russian in English in English3) TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Slovene English totale
Spanish in English TreeTagger

See Park Manual for advice on the use of tags in queries.

Problems, comments, suggestions

… on the content of the corpus and on the search interfaces are welcome at

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

<a id=“logo” href=“http://www.project-syndicate.org/|<img class=“nodeco” alt=“The highest quality commentaries and analysis from distinguished voices across the world.” title=“The highest quality commentaries and analysis from distinguished voices across the world.” src=“img/ProjectSyndycateLogo.jpg|]]

  • Newspaper texts in a number of languages from the Presseurop server
  • Legal texts in EU languages from the JRC-ACQUIS corpus
  • Proceedings of the European Parliament from the EuroParl corpus
  • Slovak-Czech concordances from the Slovak National Corpus
  • Short stories in a number of languages My 1989 from Goethe Institut
  • A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
  • George Orwell's novel 1984 in a number of languages from the Multext-East corpus
  • Ukrainian and Polish texts from the PolUkr corpus (in prep.)
  • Norwegian texts from the publishers Aschehoug &amp; co., Cappelen Forlag and Forlaget Oktober

Pre-processing

  • parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

  • Morče for Czech
  • TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, German, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
  • Morfeusz and TaKIPI for Polish
  • HunPOS for Hungarian
  • Tagger for Slovak (thanks to Radovan Garabík)
  • Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
  • Tagger for Norwegian (thanks to Pavel Vondřička)
  • totale for Slovene (thanks to Tomaž Erjavec)

Corpus Query Engine:

Last update: 2 February 2014

See also

1)
@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411–427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}
2)
There is a helper application to assist you with queries including Czech morphological tags. Click here.
3)
Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.