AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp: Release 7

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 95 814 527 116 374 744 208 845 922 1 546 493 833
Number of word forms 77 121 760 88 303 155 173 224 560 1 216 880 655
Structural attributes Number of documents 1 184 5 2 294 87
Number of div 1 184 107 388 2 294 1 817 043
Number of sentences 6 595 174 13 497 188 12 796 035 142 788 867
Further information reference YES
representative NO
publication date 2014
foreign languages 38
tagged languages 20
lemmatized languages 17

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser in two ways:

All of the three search interfaces are based on the Manatee corpus engine and access identical texts. Due to considerable overhead costs required to operate all the three interfaces, the Czech National Corpus has been aiming at a single, universal interface – KonText for some time already. Most likely, Park and NoSketch Engine will be discontinued already at the end of March 2015. We would like to use this opportunity to invite all users of InterCorp to migrate to KonText. We believe that this step is well justified. In addition the new, already implemented functions, there are many planned improvements, in part based on users' feedback. We realize that many users may find this step quite daunting. In addition to advance notifications, we answer this concern by consulting, training, and seminar offers, adapted to the needs of specific users. Please ask any of the CNC staff to arrange the details.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is published mostly each year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (in Park starting from release 5, in the other interfaces from release 6).

References

In results of your work based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex1), electronic edition at //ing entaConnect//, preprint version).

For more references see here. or in the repository of bibliographical items based on the CNC. All references to work using InterCorp is welcome. See here for details.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release 7 includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources, have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.

Setup of the parallel corpus – the core
Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Code Language Core Syndicate Presseurop Acquis Europarl Subtitles Total
ar Arabic 34 0 0 0 0 0 34
be Belarusian 1,751 0 0 0 0 0 1,751
bg Bulgarian 4,923 0 0 13,816 9,083 0 27,823
ca Catalan 4,498 0 0 0 0 0 4,498
da Danish 1,311 0 0 21,680 13,916 14,430 51,336
de German 26,315 3,050 1,715 21,724 13,089 8,367 74,260
el Greek 0 0 0 25,070 15,404 23,715 64,188
en English 12,641 3,083 1,863 24,208 15,580 52,101 109,476
es Spanish 16,907 3,479 1,948 27,001 15,885 36,379 101,599
et Estonian 0 0 0 15,963 10,900 10,296 37,158
fi Finnish 3,054 0 0 16,455 10,175 15,098 44,782
fr French 6,976 3,535 2,054 27,352 17,178 25,962 83,057
he Hebrew 0 0 0 0 0 16,221 16,221
hi Hindi 206 0 0 0 0 0 206
hr Croatian 14,210 0 0 0 0 19,093 33,303
hu Hungarian 4,014 0 0 19,177 12,307 21,240 56,737
is Icelandic 0 0 0 0 0 1,585 1,585
it Italian 6,313 247 1,893 24,849 15,489 14,654 63,446
ja Japanese 0 0 0 0 0 113 113
lt Lithuanian 358 0 0 18,393 11,213 558 30,522
lv Latvian 1,337 0 0 18,745 11,689 280 32,051
mk Macedonian 3,221 0 0 0 0 1,877 5,098
ms Malay 0 0 0 0 0 3,521 3,521
mt Maltese 0 0 0 14,133 0 0 14,133
nl Dutch 9,370 0 2,082 24,746 15,563 29,363 81,125
no Norwegian 4,103 0 0 0 0 0 4,103
pl Polish 16,009 0 1,662 20,628 12,811 26,572 77,683
pt Portuguese 2,393 0 2,103 28,603 16,485 43,392 92,976
ro Romanian 3,156 0 1,917 8,200 9,446 34,129 56,847
ru Russian 3,308 2,651 0 0 0 6,886 12,844
sk Slovak 7,402 0 0 19,223 12,734 5,134 44,493
sl Slovene 900 0 0 19,646 12,241 17,025 49,811
sq Albanian 0 0 0 0 0 2,004 2,004
sr Serbian 8,413 0 0 0 0 20,777 29,189
sv Swedish 7,789 0 0 20,586 13,840 14,694 56,909
tr Turkish 0 0 0 0 0 21,191 21,191
uk Ukrainian 2,310 0 0 0 0 246 2,556
vi Vietnamese 0 0 0 0 0 1,474 1,474
Subtotal 173,225 16,044 17,239 430,195 265,029 488,373 1,390,105
cs Czech 77,122 2,749 1,640 20,303 12,923 50,688 165,425
Total 250,346 18,793 18,880 450,498 277,952 539,061 1,555,530

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language Tags Lemmas Brief description Detailed description Tool
Bulgarian in English TreeTagger
Czech in Czech in English2) in English Morče
Dutch in Dutch TreeTagger
English in English in English + additions TreeTagger
Estonian in Estonian and English TreeTagger
Finnish in English3) OMorFi+HunPOS
French in English TreeTagger
German in English4) in German RFTagger
Hungarian in English HunPos
Icelandic IceStagger
Italian in English TreeTagger
Lithuanian in Czech and English in English Author: Vidas Daudaravičius
Norwegian in English in Norwegian analyzer, tagger
Polish in English in Polish in English Morfeusz, TaKIPI
Portuguese Spanish TreeTagger
Russian in English in English5) TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Slovene English totale
Spanish in English TreeTagger
Swedish Stagger

See Park Manual for advice on the use of tags in queries.

Problems, comments, suggestions

… on the content of the corpus and on the search interfaces are welcome at

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

  • MorfFlex, Morče and LanGr for Czech
  • TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
  • Morfeusz and TaKIPI for Polish
  • HunPOS for Hungarian and other languages
  • Tagger for Slovak (thanks to Radovan Garabík)
  • Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)
  • Tagger for Norwegian (thanks to Pavel Vondřička)
  • totale for Slovene (thanks to Tomaž Erjavec)
  • RFTagger for German
  • OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
  • Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)

Corpus Query Engine:

Last update: 19 December 2014

See also

1)
@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411–427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}
2)
There is a helper application to assist you with queries including Czech morphological tags. Click here.
3)
The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].
4)
Within a single tag, semicolon is used instead of comma as a separator of individual morphological categories, e.g. ADJA:Pos:Nom:Sg:Fem.
5)
Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.