Skrýt
Nastavení

InterCorp: Release 5

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

There are several aspects that make InterCorp special among the corpora published by ICNC:

  • InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
  • InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing.

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The current choice includes political commentaries published by Project Syndicate and Presseurop, and a package of legal texts Acquis Communautaire. These texts have been aligned automatically: search results may include a higher number of misaligned segments.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 5 (see Version history) is 91,529,000 words in the aligned foreign language texts in the core part and 451,112,000 in the collections. The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words.

Setup of the parallel corpus – the core
Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Total
be 67 0 0 0 67
bg 1 415 0 0 13 816 15 231
da 189 0 0 21 680 21 869
de 12 004 2 567 752 21 723 37 048
el 0 0 0 25 069 25 069
en 5 914 2 567 799 24 207 33 489
es 11 811 2 896 860 27 001 42 570
et 0 0 0 15 962 15 962
fi 2 081 0 0 16 667 18 748
fr 3 217 2 969 874 27 351 34 414
hr 8 103 0 0 0 8 103
hu 1 122 0 0 19 167 20 290
it 3 484 80 793 24 849 29 207
lt 352 0 0 18 432 18 785
lv 1 085 0 0 18 744 19 830
mk 32 0 0 0 32
mt 0 0 0 14 133 14 133
nl 6 486 0 899 24 746 32 132
no 2 301 0 0 0 2 301
pl 8 396 0 710 20 464 29 571
pt 2 127 0 913 28 599 31 640
ro 1 370 0 819 8 199 10 389
ru 1 664 2 304 0 0 3 969
sk 7 257 0 0 19 221 26 479
sl 991 0 0 19 645 20 636
sr 4 295 0 0 0 4 295
sv 5 754 0 0 20 615 26 369
Total 91 528 13 385 7 425 430 300 542 640
cs 52 651 2 285 704 20 285 75 926

*) Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

language tags lemmas brief description detailed description tool
Bulgarian in English TreeTagger
Czech in Czech in English1) in English Morče
Dutch TreeTagger
English in English in English + additions TreeTagger
Estonian Estonian and English TreeTagger
French in English TreeTagger
German in German in German TreeTagger
Hungarian in English HunPos
Italian in English TreeTagger
Lithuanian in Czech and English Vidas Daudaravičius
Norwegian in English in Norwegian analyzer, tagger
Polish in English in Polish in English Morfeusz, TaKIPI
Portuguese Spanish TreeTagger
Russian in English in English2) TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Slovene English totale
Spanish in English TreeTagger

See Park Manual for advice on the use of tags in queries.

Please note: work in progress

The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following software and data:

Pre-processing

  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit
  • Aligner Hunalign

Taggers/lemmatizers:

Corpus Query Engine:

Data:

Last update: 20 July 2012

See also

1)
There is a helper application to assist you with queries including Czech morphological tags. Click here.
2)
Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.