AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp Release 10

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 127,413,531 118,069,703 311,809,130 1,551,411,225
Number of word forms 102,609,763 89,841,420 258,807,848 1,225,034,182
Structural attributes Number of documents 1,507 6 3,232 106
Number of div 1,507 111,672 3,232 1,841,341
Number of sentences 8,803,067 13,593,172 19,207,592 142,734,479
Further information reference YES
representative NO
publication date 2017
foreign languages 39
tagged languages 23
lemmatized languages 22

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

References

If you publish results based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M., Zasina, A. (2017) The InterCorp Corpus – English, German1), version 10 of ?? September 2017. Institute of the Czech National Corpus, Charles University, Prague 2017. Available on-line: http://www.korpus.cz

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 10 from September 2017 is 258 mil. words in the aligned foreign language texts in the core part and 1,225 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections
Setup of the parallel corpus – the core
Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Bible Total
ar Arabic 34 0 0 0 0 0 0 34
be Belarusian 3,967 0 0 0 0 0 0 3,967
bg Bulgarian 6,465 0 0 13,572 9,067 0 0 29,103
ca Catalan 4,645 0 0 0 0 0 736 5,381
da Danish 4,548 0 0 20,313 13,916 14,430 657 53,581
de German 33,053 4,457 2,483 20,610 13,089 8,393 724 82,809
el Greek 0 0 0 23,854 15,404 23,715 0 62,972
en English 24,567 4,604 2,670 22,902 15,576 52,123 730 123,172
es Spanish 21,036 5,322 2,859 26,262 16,249 36,650 0 108,377
et Estonian 0 0 0 14,896 10,899 10,298 0 36,093
fi Finnish 4,074 0 0 15,489 10,175 15,098 544 45,380
fr French 15,073 5,391 3,046 26,200 17,179 25,991 764 93,644
he Hebrew 0 0 0 0 0 16,221 0 16,221
hi Hindu 409 0 0 0 0 0 0 409
hr Croatian 20,146 0 0 0 0 19,049 571 39,767
hu Hungarian 5,626 0 0 17,853 12,198 21,115 0 56,791
is Icelandic 0 0 0 0 0 1,585 0 1,585
it Italian 10,784 1,141 2,747 23,771 15,494 14,701 684 69,321
ja Japanese 0 0 0 0 0 113 0 113
lt Lithuanian 358 0 0 17,316 11,213 558 471 29,916
lv Latvian 2,025 0 0 17,533 11,682 280 0 31,521
mk Macedonian 5,939 0 0 0 0 1,877 0 7,816
ms Malay 0 0 0 0 0 3,521 0 3,521
mt Maltese 0 0 0 13,953 0 0 0 13,953
nl Dutch 13,454 711 2,953 23,416 15,558 29,373 717 86,181
no Norwegian 5,305 0 0 0 0 0 722 6,026
pl Polish 23,238 0 2,378 19,594 12,811 26,572 583 85,176
pt Portuguese 3,473 520 3,000 27,301 16,485 43,392 760 94,930
rn Romani 14 0 0 0 0 0 0 14
ro Romanian 3,888 0 2,738 8,092 9,446 34,129 0 58,293
ru Russian 5,978 3,767 0 0 0 6,887 565 17,197
sk Slovak 8,545 0 0 18,400 12,734 5,134 561 45,375
sl Slovenian 2,952 0 0 18,485 12,241 17,025 0 50,702
sq Albanian 0 0 0 0 0 2,004 0 2,004
sr Serbian 10,207 0 0 0 0 20,728 0 30,934
sv Swedish 10,269 0 0 19,609 13,840 14,694 638 59,051
tr Turkish 0 0 0 0 0 21,191 0 21,191
uk Ukrainian 8,736 0 0 0 0 246 600 9,583
vi Vietnamese 0 0 0 0 0 1,474 0 1,474
Subtotal 361,418 30,044 27,189 428,621 278,178 539,250 11,593 1,676,293
cs Czech 102,610 4,131 2,315 19,218 12,923 50,688 566 192,451
TOTAL 464,027 34,175 29,504 447,840 291,101 589,938 12,159 1,868,744

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language Tags Lemmas Brief description Detailed description Tool
Bulgarian in English TreeTagger
Catalan in English TreeTagger
Croatian in English ReLDI Tagger
Czech in Czech and English in English Morče
Dutch in English in Dutch TreeTagger
English in English in English + additions TreeTagger
Estonian in Estonian and English TreeTagger
Finnish in English*) in English*) OMorFi +HunPOS
French in English TreeTagger
German in English** in German RFTagger
Hungarian in English RFTagger
Icelandic in English IceStagger
Italian in English TreeTagger
Latvian in Latvian LVTagger
Norwegian in English and Norwegian VISL
Polish in English and Polish in English Morfeusz, TaKIPI
Portuguese in Spanish TreeTagger
Russian in English in English*** TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Slovene in English and Slovene in English ToTaLe
Serbian in English ReLDI Tagger
Spanish in English TreeTagger
Swedish in Swedish and English Stagger

*) The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].

**) Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.

***) Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.

Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. In a page displaying results open the View/Corpus-specific settings… menu to check the tag option in the Positional attributes box and choose the for each token option in the Viewing options box.

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. “$” in the English tag “wp$”, must be preceded in queries by a backslash: tag=“wp\$”.

Structural attributes

StructureAttributeDescriptionValues
docdoc.idunique document identifiertext
doc.langlanguagear / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh
doc.versionversionnumber
doc.wordcountdocument size in wordsnumber
divdiv.idtext identificationauthor's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE / _BIBLE
div.groupdivision inCore / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible
div.wordcountnumber of wordsnumber
div.authorauthorlast name, first name
div.titlefull titletext
div.publisherpublishertext
div.pubplacepublication placetext
div.pubyearpublication yeardate
div.txtypetext typediscussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles / religious
div.originalis the text an original?Yes / No
div.srclanglanguage of the originalar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh
div.translatortranslatorlast name, first name
div.transsextranslator's genderF / M
div.authsexauthor's genderF / M
pp.idunique paragraph identifiertext
ss.idunique sentence identifiertext

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • Parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

See also

1)
Insert actually used languages.