Nastavení

This is an old revision of the document!


InterCorp Release 9

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 120,443,181 117,981,673 278,445,878 1,556,840,965
Number of word forms 96,956,714 89,645,545 231,501,606 1,228,896,294
Structural attributes Number of documents 1430 5 2,934 89
Number of div 1,430 111,263 2,934 1,849,184
Number of sentences 8,308,814 13,588,082 17,210,601 143,478,514
Further information reference YES
representative NO
publication date 2016
foreign languages 39
tagged languages 23
lemmatized languages 20

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial in Czech is available here.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

References

If you publish results based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work using InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M.: Korpus InterCorp – English, German1), version 7 of 19 Dec 2014. Institute of the Czech National Corpus, Charles University, Prague 2014. Available on-line: http://www.korpus.cz

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 9 from July 2016 is 231 mil. words in the aligned foreign language texts in the core part and 1,228 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections
Setup of the parallel corpus – the core
Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Total
ar Arabic 34 0 0 0 0 0 34
be Belarusian 3,025 0 0 0 0 0 3,025
bg Bulgarian 6,007 0 0 13,816 9,083 0 28,907
ca Catalan 4,632 0 0 0 0 0 4,632
da Danish 3,556 0 0 21,679 13,915 14,429 53,581
de German 31,168 3,725 2,482 21,723 13,089 8,366 80,556
el Greek 0 0 0 25,069 15,403 23,714 64,187
en English 21,208 3,818 2,670 24,207 15,580 52,101 119,586
es Spanish 19,310 4,324 2,816 27,001 15,885 36,378 105,716
et Estonian 0 0 0 15,962 10,899 10,296 37,158
fi Finnish 3,645 0 0 16,455 10,175 15,097 45,373
fr French 12,406 4,393 2,928 27,351 17,178 25,961 90,219
he Hebrew 0 0 0 0 0 16,221 16,221
hi Hindu 408 0 0 0 0 0 408
hr Croatian 19,980 0 0 0 0 19,042 39 023
hu Hungarian 5,818 0 0 19,176 12,306 21,239 58,541
is Icelandic 0 0 0 0 0 1,584 1,584
it Italian 8,694 651 2,707 24,849 15,489 14,653 67,046
ja Japanese 0 0 0 0 0 113 113
lt Lithuanian 358 0 0 18,392 11,212 557 30,521
lv Latvian 1,336 0 0 18,709 11,682 279 32,007
mk Macedonian 4,663 0 0 0 0 1,877 6,540
ms Malay 0 0 0 0 0 3,520 3,520
mt Maltese 0 0 0 14,133 0 0 14,133
nl Dutch 11,444 314 2,955 24,746 15,563 29,362 84,386
no Norwegian 4,965 0 0 0 0 0 4,965
pl Polish 21,433 0 2,378 20,627 12, 26,572 83,822
pt Portuguese 2,605 369 2,999 28,602 16,484 43,391 94,454
rn Romani 5 0 0 0 0 0 5
ro Romanian 3,432 0 2,737 8,199 9,446 34,128 57,944
ru Russian 4,788 3,174 0 0 0 6,885 14,848
sk Slovak 8,066 0 0 19,222 12,734 5,134 45,158
sl Slovenian 2,057 0 0 19,645 12,240 17,024 50,968
sq Albanian 0 0 0 0 0 2,003 2,003
sr Serbian 9,886 0 0 0 0 20,720 30,607
sv Swedish 8,959 0 0 20,585 13,840 14,693 58,079
tr Turkish 0 0 0 0 0 21,190 21,190
uk Ukrainian 7,597 0 0 0 0 246 7,843
vi Vietnamese 0 0 0 0 0 1,473 1,473
Subtotal 231,501 20,769 24,676 430,160 265,022 488,266 1,460,397
cs Czech 96,956 3,416 2,315 20,303 12,922 50,688 186,602
TOTAL 328,458 24,186 26,991 450,463 277,945 538,954 1,647,000

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

Language Tags Lemmas Brief description Detailed description Tool
Bulgarian in English TreeTagger
Croatian in English ReLDI Tagger
Czech in Czech and in English2) in English Morče
Dutch in Dutch TreeTagger
English in English in English + additions TreeTagger
Estonian in Estonian and English TreeTagger
Finnish Enlish3) OMorFi+HunPOS
French in English TreeTagger
German in English4) in German RFTagger
Hungarian in English HunPos
Icelandic in English IceStagger
Italian in English TreeTagger
Latvian in Latvian LVTagger
Lithuanian in Czech and English in English Author: Vidas Daudaravičius
Norwegian in English and Norwegian VISL
Polish in English and Polish in English Morfeusz, TaKIPI
Portuguese in Spanish TreeTagger
Russian in English in English 5) TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Slovene in English and Slovene in English ToTaLe
Serbian in English ReLDI Tagger
Spanish in English TreeTagger
Swedish in Swedish and English Stagger

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. “$” in the English tag “wp$”, must be preceded in queries by a backslash: tag=“wp\$”.

Structural attributes

StructureAttributeDescriptionValues
docdoc.idunique document identifiertext
doc.langlanguagear / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh
doc.versionversionnumber
doc.wordcountdocument size in wordsnumber
divdiv.idtext identificationauthor's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE
div.groupdivision inCore / Acquis / Europarl / PressEurop / Subtitles / Syndicate
div.wordcountnumber of wordsnumber
div.authorauthorlast name, first name
div.titlefull titletext
div.publisherpublishertext
div.pubplacepublication placetext
div.pubyearpublication yeardate
div.txtypetext typediscussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles
div.originalis the text an original?Yes / No
div.srclanglanguage of the originalar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh
div.translatortranslatorlast name, first name
div.transsextranslator's genderF / M
div.authsexauthor's genderF / M
pp.idunique paragraph identifiertext
ss.idunique sentence identifiertext

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • Parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

See also

1) Insert actually used languages.
2) There is a helper application to assist you with queries including Czech morphological tags. Click here.
3) The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].
4) Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.
5) Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.