Obsah

InterCorp Release 16

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 154 512 254 363 685 460 464 653 933 5 840 602 221
Number of word forms 124 679 582 272 862 335 386 728 679 4 505 550 764
Structural attributes Number of documents 1 812 33 4 643 338
Number of texts 1 812 162 612 4 643 2 662 665
Number of sentences 10 691 339 50 729 559 28 684 678 790 046 584
Further information reference YES
representative NO
publication date 2023
foreign languages 61
tagged languages 27
lemmatized languages 25

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Alexandr Rosen if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16 published in October 2023 is 387 mil. words in the aligned foreign language texts in the core part and 4 506 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections


Setup of the parallel corpus – the core


Setup of the parallel corpus – collections

Corpus size in thousands of words

Language Core Syndicate Presseurop Acquis Europarl Subtitles Bible Total
af Afrikaans 0 0 0 0 0 136 0 136
ar Arabic 34 384 0 0 0 126 157 0 126 576
be Belarusian 7 131 0 0 0 0 0 0 7 131
bg Bulgarian 7 068 0 0 13 577 9 083 165 092 0 194 820
bn Bengali 0 0 0 0 0 1 554 0 1 554
br Breton 0 0 0 0 0 98 0 98
bs Bosnian 0 0 0 0 0 58 758 0 58 758
ca Catalan 10 112 0 0 0 0 2 735 736 13 582
cs Czech 124 680 4 717 2 312 19 214 12 917 233 139 563 397 542
da Danish 9 548 0 0 20 313 13 916 71 825 657 116 259
de German 40 679 5 067 2 483 20 610 13 089 98 566 724 181 219
el Greek 0 0 0 23 853 15 404 162 561 0 201 818
en English 42 395 5 273 2 670 22 902 15 576 280 335 730 369 882
eo Esperanto 0 0 0 0 0 226 0 226
es Spanish 30 661 6 074 2 859 26 262 16 249 223 134 0 305 240
et Estonian 79 0 0 14 896 10 899 54 514 0 80 388
eu Basque 0 0 0 0 0 3 022 0 3 022
fa Persian 0 0 0 0 0 33 167 0 33 167
fi Finnish 6 959 0 0 15 269 10 108 90 471 543 123 349
fr French 24 361 5 896 3 046 26 200 17 179 181 433 764 258 879
gl Galician 0 0 0 0 0 623 0 623
he Hebrew 0 0 0 0 0 130 143 0 130 143
hi Hindi 409 0 0 0 0 432 0 841
hr Croatian 24 529 0 0 0 0 137 966 571 163 066
hs Upper Sorbian 466 0 0 0 0 0 0 466
hu Hungarian 6 921 8 0 17 852 12 198 141 691 0 178 670
hy Armenian 0 0 0 0 0 24 0 24
id Indonesian 0 0 0 0 0 38 343 0 38 343
is Icelandic 0 0 0 0 0 7 375 0 7 375
it Italian 18 086 1 389 2 747 23 771 15 494 163 622 684 225 793
ja Japanese 3 818 2 0 0 0 12 485 0 16 305
ka Georgian 0 0 0 0 0 889 0 889
kk Kazakh 0 0 0 0 0 14 0 14
ko Korean 0 0 0 0 0 5 980 0 5 980
lt Lithuanian 696 0 0 17 316 11 213 5 269 471 34 964
lv Latvian 3 636 0 0 17 533 11 682 2 053 537 35 441
mk Macedonian 8 881 0 0 0 0 15 595 0 24 476
ml Malayalam 0 0 0 0 0 1 281 0 1 281
ms Malay 0 0 0 0 0 7 939 0 7 939
mt Maltese 0 0 0 13 935 0 0 0 13 935
nl Dutch 18 782 812 2 953 23 416 15 558 170 979 717 233 217
no Norwegian 8 221 0 0 0 0 39 807 724 48 752
pl Polish 28 597 0 2 380 19 604 12 817 169 498 583 233 480
pt Portuguese 7 285 739 2 782 24 598 15 193 229 515 706 280 818
rn Romani 14 0 0 0 0 0 0 14
ro Romanian 4 219 0 2 738 8 092 9 446 212 396 0 236 890
ru Russian 12 387 4 302 0 0 0 104 609 565 121 864
si Sinhala 0 0 0 0 0 2 346 0 2 346
sk Slovak 8 586 0 0 18 399 12 727 34 581 561 74 854
sl Slovene 4 636 0 0 18 515 12 241 83 000 0 118 392
sq Albanian 0 0 0 0 0 9 351 0 9 351
sr Serbian 12 706 0 0 0 0 152 636 0 165 342
sv Swedish 19 740 0 0 19 542 13 784 81 548 638 135 252
ta Tamil 0 0 0 0 0 104 0 104
te Telugu 0 0 0 0 0 96 0 96
th Thai 0 0 0 0 0 5 660 0 5 660
tl Tagalog 0 0 0 0 0 38 0 38
tr Turkish 0 0 0 0 0 149 892 0 149 892
uk Ukraininan 14 849 0 0 0 0 2 938 596 18 382
ur Urdu 0 0 0 0 0 158 0 158
vi Vietnamese 0 0 0 0 0 22 298 0 22 298
zh Chinese 238 838 0 0 0 71 331 0 72 407
TOTAL 511 408 35 503 26 971 425 670 276 772 4 001 428 12 069 5 289 821

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Number of texts in the Core

Language Number of texts including originals
ar Arabic 3 1
be Belarusian 108 14
bg Bulgarian 87 19
ca Catalan 92 1
cs Czech 1 812 368
da Danish 93 9
de German 471 163
en English 422 271
es Spanish 355 142
et Estonian 1 0
fi Finnish 112 36
fr French 277 126
hi Hindi 7 2
hr Croatian 324 37
hs Upper Sorbian 13 5
hu Hungarian 89 1
it Italian 171 26
ja Japanese 35 15
lt Lithuanian 23 4
lv Latvian 73 15
mk Macedonian 108 4
nl Dutch 215 52
no Norwegian 102 23
pl Polish 348 54
pt Portuguese 87 24
rn Romani 2 2
ro Romanian 45 5
ru Russian 160 37
sk Slovak 165 62
sl Slovene 73 25
sr Serbian 148 13
sv Swedish 232 101
uk Ukrainian 199 8
zh Chinese 3 3
TOTAL 6 455 1 668

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface.

Language Tags Lemmas Brief description Detailed description Tags in the corpus Tool
Belarusian in English****) in English****) list UDPipe
Bulgarian in English in English list TreeTagger
Catalan in English list TreeTagger
Chinese in English in English list ZPar v0.7.5
Croatian in English in English list ReLDI Tagger
Czech in Czech and English in English list Morče
Dutch in English list TreeTagger
English in English in English + additions list TreeTagger
Estonian in Estonian and English list TreeTagger
Finnish in English*) in English*) list OMorFi +HunPOS
French in English list TreeTagger
German in English **) in German list RFTagger
Hungarian in English list RFTagger
Icelandic in English in English list IceStagger
Italian in English list TreeTagger
Japanese in English list MeCab + Unidic
Latvian in Latvian list LVTagger
Norwegian in English****) in English****) list UDPipe
Polish in English and Polish in English list Morfeusz, KRNNT
Portuguese in Spanish list TreeTagger
Russian in English in English ***) list TreeTagger
Slovak in Slovak and English in Slovak list Radovan Garabík, Morče
Slovene in English list ReLDI Tagger
Serbian in English in English list ReLDI Tagger
Spanish in English list TreeTagger
Swedish in Swedish and English list Stagger
Ukrainian in English****) in English****) list UDPipe

*) The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].

**) Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.

***) Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.

****) The tag is in the UD (Universal Dependencies) format, components of the tag are separated by a vertical bar (|), e.g. the form школы in genitive singular is tagged as: NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing. The query can be specified in the same way as for other languages, treating the tag as a string, i.e.\ [tag="NOUN.*Case=Gen\|Gender=Fem.*"] or the tag components can be specified separately: [tag="Case=Gen" & tag="NOUN" & tag="Gender=Fem"] (the order of categories is not significant). The result is identical in either case.

Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. You can have all tags used in the corpus for a given language listed – see the column Tags in the corpus in the table above. Or in a page displaying results open the View/Corpus-specific settings… menu to check the tag option in the Positional attributes box and choose the for each token option in the Viewing options box.

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. $ in the English tag wp$, must be preceded in queries by a backslash: tag="wp\$".

Structural attributes

StructureAttributeDescriptionValues
docdoc.iddocument identifier author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP_year / _SUBTITLES / _SYNDICATE_year / _OT / _NT
texttext.idtext identifierauthor's_last_name-shortened_title:0 / _ACQUIS:number / _EUROPARL:number / _PRESSEUROP:number / _SUBTITLES:number / _SYNDICATE_year:name / _OT:book / _NT:book
text.authorauthorlast name, first name
text.titlefull titletext
text.langlanguagear / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh
text.versionversionnumber
text.groupcore/collection Core / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible
text.publisherpublishertext
text.pubplacepublication placetext
text.pubDateYearpublication yearnumber
text.pubDateMonthpublication monthnumber
text.origyearoriginal creation yearnumber
text.isbnISBNnumber
text.txtypetext typediscussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles / religious
text.commentcommenttext
text.originaloriginal version?Yes / No
text.srclanglanguage of the originalar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh
text.translatortranslatorlast name, first name
text.transsextranslator's genderF / M
text.authsexauthor's genderF / M
text.transcommenttranslation commenttext
text.collectiontitlecollection titletext
text.volumevolume numbernumber
text.pagesnumber of pagesnumber
text.lang_varlanguage varietyde-AT / de-CH / de-DE / en-AU / en-CA / en-GB / en-UM / en-US / es-ES / es-MX / es-PE / fr-BE / fr-FR / it-CH / it-IT / nl-BE / nl-NL / pt-BR / pt-PT / sr-RS
text.wordcountnumber of wordsnumber
divdiv.iddivision identifier (Bible) _NT / _OT:chapter
div.typedivision typechapter
pp.idparagraph identifierdoc:text:div:par
ss.idsentence identifierdoc:text:div:par:sent
hihi.rendtypefaceitalic / bold / bold italic
lblb.idverse identifier (Bible)book:chapter:verse

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

Taggers/lemmatizers:

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M., Zasina, A. J. (2022). The InterCorp Corpus – Czech3), version 16 of 11 November 2022. Institute of the Czech National Corpus, Charles University, Prague 2022. Available on-line: https://kontext.korpus.cz/

See also

1)
Ljubešić, N., Klubička, F., Željko Agić, and Jazbec, I.-P. (2016). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Calzolari, N. et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA).
2)
Ljubešić, N. and Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Calzolari, N. et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA).
3)
Insert languages actually used.