AplikaceAplikace
Nastavení

InterCorp Release 15

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 148 487 713 117 094 767 434 905 960 1 551 791 814
Number of word forms 119 933 378 90 181 070 361 991 365 1 226 159 823
Structural attributes Number of documents 1 743 33 4 372 313
Number of texts 1 743 112 393 4 372 1 846 588
Number of sentences 10 288 141 13 626 168 26 843 652 143 334 058
Further information reference YES
representative NO
publication date 2022
foreign languages 41
tagged languages 27
lemmatized languages 25

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Alexandr Rosen if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 15 published in November 2022 is 362 mil. words in the aligned foreign language texts in the core part and 1 226 mil. words in the collections. The number of words in the Czech texts is 120 mil. in the core part and 90 mil. in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections


Setup of the parallel corpus – the core


Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Bible Total
ar Arabic 34 384 0 0 0 0 0 418
be Belarusian 6 524 0 0 0 0 0 0 6 524
bg Bulgarian 7 068 0 0 13 577 9 083 0 0 29 728
ca Catalan 8 920 0 0 0 0 0 736 9 656
da Danish 8 456 0 0 20 313 13 916 14 429 657 57 770
de German 39 412 5 067 2 483 20 610 13 088 8 392 724 89 776
el Greek 0 0 0 23 853 15 404 23 709 0 62 966
en English 38 706 5 273 2 670 22 902 15 576 52 106 730 137 964
es Spanish 29 145 6 074 2 859 26 262 16 249 36 650 0 117 239
et Estonian 0 0 0 14 896 10 899 10 298 0 36 093
fi Finnish 6 674 0 0 15 269 10 108 15 047 543 47 641
fr French 21 996 5 896 3 046 26 200 17 179 25 986 764 101 067
he Hebrew 0 0 0 0 0 16 221 0 16 221
hi Hindi 409 0 0 0 0 0 0 409
hr Croatian 23 351 0 0 0 0 19 048 571 42 971
hs Upper 128 0 0 0 0 0 0 128
hu Hungarian 6 922 8 0 17 852 12 198 21 115 0 58 095
is Icelandic 0 0 0 0 0 1 581 0 1 581
it Italian 16 384 1 389 2 747 23 771 15 494 14 700 684 75 169
ja Japanese 3 491 2 0 0 0 477 0 3 970
lt Lithuanian 502 0 0 17 316 11 213 558 471 30 059
lv Latvian 3 437 0 0 17 522 11 682 280 537 33 458
mk Macedonian 8 881 0 0 0 0 1 877 0 10 758
ms Malay 0 0 0 0 0 3 521 0 3 521
mt Maltese 0 0 0 13 935 0 0 0 13 935
nl Dutch 17 769 812 2 953 23 416 15 558 29 373 717 90 598
no Norwegian 7 851 0 0 0 0 0 724 8 575
pl Polish 28 112 0 2 380 19 604 12 817 26 576 583 90 072
pt Portuguese 6 943 739 2 782 24 598 15 193 41 468 706 92 429
rn Romani 14 0 0 0 0 0 0 14
ro Romanian 4 219 0 2 738 8 092 9 446 34 128 0 58 622
ru Russian 10 549 4 302 0 0 0 6 887 565 22 303
sk Slovak 8 596 0 0 18 399 12 727 5 133 561 45 416
sl Slovene 4 354 0 0 18 515 12 241 17 035 0 52 144
sq Albanian 0 0 0 0 0 2 003 0 2 003
sr Serbian 12 356 0 0 0 0 20 727 0 33 082
sv Swedish 17 877 0 0 19 542 13 784 14 666 638 66 507
tr Turkish 0 0 0 0 0 21 190 0 21 190
uk Ukrainian 12 712 0 0 0 0 244 596 13 551
vi Vietnamese 0 0 0 0 0 1 474 0 1 474
zh Chinese 202 604 0 0 0 2 247 0 3 054
Subtotal 361 991 30 552 24 658 406 445 263 854 489 143 11 507 1 588 151
cs Czech 119 933 4 712 2 310 19 085 12 908 50 604 562 210 114
TOTAL 481 925 35 264 26 968 425 530 276 763 539 747 12 069 1 798 266

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface.

Language Tags Lemmas Brief description Detailed description Tags in the corpus Tool
Belarusian in English****) in English****) list UDPipe
Bulgarian in English in English list TreeTagger
Catalan in English in English list TreeTagger
Chinese in English in English list ZPar v0.7.5
Croatian in English in English list ReLDI Tagger
Czech in Czech and English in English list Morče
Dutch in English list TreeTagger
English in English in English + additions list TreeTagger
Estonian in Estonian and English list TreeTagger
Finnish in English*) in English*) list OMorFi +HunPOS
French in English list TreeTagger
German in English **) in German list RFTagger
Hungarian in English list RFTagger
Icelandic in English in English list IceStagger
Italian in English list TreeTagger
Japanese in English list MeCab + Unidic
Latvian in Latvian list LVTagger
Norwegian in English****) in English****) list UDPipe
Polish in English and Polish in English list Morfeusz, KRNNT
Portuguese in Spanish list TreeTagger
Russian in English in English ***) list TreeTagger
Slovak in Slovak and English in Slovak list Radovan Garabík, Morče
Slovene in English list ReLDI Tagger
Serbian in English in English list ReLDI Tagger
Spanish in English list TreeTagger
Swedish in Swedish and English list Stagger
Ukrainian in English****) in English****) list UDPipe

*) The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].

**) Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.

***) Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.

****) The tag is in the UD (Universal Dependencies) format, components of the tag are separated by a vertical bar (|), e.g. the form школы in genitive singular is tagged as: NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing. The query can be specified in the same way as for other languages, treating the tag as a string, i.e.\ [tag="NOUN.*Case=Gen\|Gender=Fem.*"] or the tag components can be specified separately: [tag="Case=Gen" & tag="NOUN" & tag="Gender=Fem"] (the order of categories is not significant). The result is identical in either case.

Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. You can have all tags used in the corpus for a given language listed – see the column Tags in the corpus in the table above. Or in a page displaying results open the View/Corpus-specific settings… menu to check the tag option in the Positional attributes box and choose the for each token option in the Viewing options box.

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. $ in the English tag wp$, must be preceded in queries by a backslash: tag="wp\$".

Structural attributes

StructureAttributeDescriptionValues
docdoc.iddocument identifier author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP_year / _SUBTITLES / _SYNDICATE_year / _OT / _NT
texttext.idtext identifierauthor's_last_name-shortened_title:0 / _ACQUIS:number / _EUROPARL:number / _PRESSEUROP:number / _SUBTITLES:number / _SYNDICATE_year:name / _OT:book / _NT:book
text.authorauthorlast name, first name
text.titlefull titletext
text.langlanguagear / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh
text.versionversionnumber
text.groupcore/collection Core / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible
text.publisherpublishertext
text.pubplacepublication placetext
text.pubDateYearpublication yearnumber
text.pubDateMonthpublication monthnumber
text.origyearoriginal creation yearnumber
text.isbnISBNnumber
text.txtypetext typediscussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles / religious
text.commentcommenttext
text.originaloriginal version?Yes / No
text.srclanglanguage of the originalar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh
text.translatortranslatorlast name, first name
text.transsextranslator's genderF / M
text.authsexauthor's genderF / M
text.transcommenttranslation commenttext
text.collectiontitlecollection titletext
text.volumevolume numbernumber
text.pagesnumber of pagesnumber
text.lang_varlanguage varietyde-AT / de-CH / de-DE / en-AU / en-CA / en-GB / en-UM / en-US / es-ES / es-MX / es-PE / fr-BE / fr-FR / it-CH / it-IT / nl-BE / nl-NL / pt-BR / pt-PT / sr-RS
text.wordcountnumber of wordsnumber
divdiv.iddivision identifier (Bible) _NT / _OT:chapter
div.typedivision typechapter
pp.idparagraph identifierdoc:text:div:par
ss.idsentence identifierdoc:text:div:par:sent
hihi.rendtypefaceitalic / bold / bold italic
lblb.idverse identifier (Bible)book:chapter:verse

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

  • Parallel text editor InterText by Pavel Vondřička
  • Aligner Hunalign
  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M., Zasina, A. J. (2022). The InterCorp Corpus – Czech3), version 15 of 11 November 2022. Institute of the Czech National Corpus, Charles University, Prague 2022. Available on-line: https://kontext.korpus.cz/

See also

1)
Ljubešić, N., Klubička, F., Željko Agić, and Jazbec, I.-P. (2016). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Calzolari, N. et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA).
2)
Ljubešić, N. and Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Calzolari, N. et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA).
3)
Insert languages actually used.