Obsah

InterCorp Release 16ud – Universal Dependencies

Name Czech – core Czech – collections other – core other – collections
Positions Number of tokens 154 391 397 362 409 841 461 601 109 5 732 688 636
Number of word forms 124 681 856 272 671 041 385 829 717 4 473 418 338
Structural attributes Number of documents 1 812 33 4 643 338
Number of texts 1 812 162 613 4 643 2 662 675
Number of sentences 10 691 340 50 729 559 28 684 709 790 046 584
Further information reference YES
representative NO
publication date 2024
foreign languages 61
tagged languages 47
lemmatized languages 47
syntactically annotated languages 47

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Alexandr Rosen if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

Main features of release 16ud

Texts in the corpus

InterCorp release 16ud contains the same texts as InterCorp release 16. They differ only in linguistic annotation. However, the token and word count data in 16ud may differ slightly due to a different tokenization method.

The core of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually checked. The other texts, grouped in collections, are aligned automatically without human intervention. The choice in the present release includes:

In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16ud published in September 2024 is 4 746 mil. words. This number includes 382 mil. words in the aligned foreign language texts in the core part and 4 746 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see Version history). The share of the core and the collections in the corpus is shown in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections


Setup of the parallel corpus – the core


Setup of the parallel corpus – collections

The corpus in numbers

Number of texts in the Core

Language Number of texts including originals
ar Arabic 3 1
be Belarusian 108 14
bg Bulgarian 87 19
ca Catalan 92 1
cs Czech 1 812 368
da Danish 93 9
de German 471 163
en English 422 271
es Spanish 355 142
et Estonian 1 0
fi Finnish 112 36
fr French 277 126
hi Hindi 7 2
hr Croatian 324 37
hs Upper Sorbian 13 5
hu Hungarian 89 1
it Italian 171 26
ja Japanese 35 15
lt Lithuanian 23 4
lv Latvian 73 15
mk Macedonian 108 4
nl Dutch 215 52
no Norwegian 102 23
pl Polish 348 54
pt Portuguese 87 24
rn Romani 2 2
ro Romanian 45 5
ru Russian 160 37
sk Slovak 165 62
sl Slovene 73 25
sr Serbian 148 8
sv Swedish 232 101
uk Ukrainian 199 8
zh Chinese 3 3
TOTAL 6 495 1 668

In the tables below, the Core part of the corpus is split according to the text type into fiction (Core-fiction), non-fiction (Core-nonfiction), and miscellaneous (Core-misc), including drama, poetry or children's literature).

Corpus size by collection

Collection Number of Thousands of
docs texts sentences words tokens
Core-fiction 5 879 5 879 37 270 473 208 572 187
Core-misc 226 226 623 7 853 9 424
Core-nonfiction 350 350 1 483 29 450 34 381
Acquis 22 380 049 28 903 424 874 531 415
Bible 38 1 252 899 12 050 14 405
Europarl 21 1 369 378 13 709 276 543 315 134
PressEurop 70 69 894 1 637 26 964 31 538
Subtitles 58 965 557 793 931 3 970 273 5 162 184
Syndicate 162 39 158 1 697 35 385 40 423
TOTAL 6 826 2 831 743 880 152 5 256 601 6 711 091

Corpus size by language

Lang Number of Thousands of
docs texts sentences words tokens
af 1 24 23.0 134.6 161.7
ar 7 34 629 28 748.8 126 614.3 157 671.0
be 108 108 632.7 7 126.4 9 054.9
bg 90 97 190 34 421.2 194 375.7 250 957.1
bn 1 252 363.8 1 517.7 2 072.1
br 1 27 19.7 97.4 145.2
bs 1 14 208 12 165.3 56 465.9 75 945.3
ca 95 828 1 201.8 13 381.4 15 617.1
cs 1 845 164 425 61 420.9 397 352.9 516 801.2
da 98 101 609 16 583.0 115 590.0 146 193.4
de 504 115 755 23 827.8 181 773.9 229 774.0
el 3 125 684 33 174.5 200 922.9 254 776.7
en 455 157 490 54 572.6 357 080.3 449 890.9
eo 1 46 48.4 221.0 305.4
es 386 150 798 45 280.2 305 112.0 388 664.2
et 4 100 709 13 904.0 80 349.3 104 726.8
eu 1 652 732.9 2 999.9 4 039.0
fa 1 6 556 6 594.8 32 635.9 38 097.3
fi 117 116 660 25 976.1 123 357.7 165 696.1
fi 310 138 571 33 957.7 258 555.1 315 325.2
fr 1 146 121.7 622.1 797.9
gl 1 33 935 27 608.8 129 458.6 172 973.7
he 8 61 116.6 832.7 988.1
hi 327 35 447 30 758.6 162 943.8 208 413.5
hr 13 13 41.6 466.3 586.3
hs 95 125 933 34 510.0 178 525.6 240 411.9
hu 1 7 3.9 23.5 30.6
hy 1 8 350 8 112.7 37 824.9 49 694.7
id 1 1 135 1 497.9 7 374.2 9 299.9
is 194 134 401 33 361.2 226 224.9 286 343.4
it 37 2 363 2 296.7 16 138.6 18 020.3
ja 1 204 198.4 871.1 1 179.0
ka 1 4 4.1 13.9 19.2
kk 1 1 605 1 641.1 5 964.3 7 294.3
ko 28 87 642 3 622.1 34 786.3 45 134.4
lt 78 86 356 3 023.6 35 425.1 45 293.5
lv 109 3 541 3 907.8 23 993.1 30 898.6
mk 1 285 365.3 1 258.4 1 793.5
ml 1 1 496 1 712.1 7 828.0 10 573.3
ms 1 8 963 784.8 13 805.0 16 643.6
mt 232 132 791 33 065.4 233 111.3 284 402.6
nl 105 9 163 8 344.6 48 750.2 61 120.3
no 360 140 055 41 282.4 227 242.6 300 207.8
pl 107 147 063 46 510.1 280 566.2 355 121.8
pt 2 2 1.7 13.6 17.7
rn 55 102 904 39 561.2 235 702.3 295 301.3
ro 184 32 839 22 985.2 122 130.4 163 120.7
si 1 499 522.5 2 313.4 3 021.8
sk 170 94 585 10 080.0 74 862.7 95 881.0
sl 76 104 460 20 501.3 118 457.1 155 788.9
sq 1 1 575 1 769.0 9 171.4 12 098.4
sr 149 38 177 32 117.7 165 130.2 211 727.6
sv 237 104 739 19 113.9 135 088.4 164 715.5
ta 1 20 29.4 104.0 141.8
te 1 18 26.0 96.0 127.1
th 1 3 932 3 457.0 5 626.0 7 288.3
tl 1 5 8.0 37.0 52.7
tr 1 44 015 35 975.7 147 635.3 199 108.2
uk 202 1 271 2 138.0 19 225.4 24 818.3
ur 1 19 27.0 155.7 180.8
vi 1 3 468 3 304.5 19 281.4 23 984.0
zh 9 12 035 11 993.7 71 855.3 80 560.0
TOTAL 6 826 2 831 743 880 152.2 5 256 601.0 6 711 091.0

Corpus size in thousands of words by language and collection

Lang Core-fiction Core-misc Core-nonfiction Acquis Bible Europarl PressEurop Subtitles Syndicate TOTAL
af 134.6 134.6
ar 28.8 5.5 126 195.5 384.5 126 614.3
be 7 068.7 57.7 7 126.4
bg 7 067.3 13 582.3 9 082.0 164 644.1 194 375.7
bn 1 517.7 1 517.7
br 97.4 97.4
bs 56 465.9 56 465.9
ca 9 951.3 9.7 728.2 2 692.1 13 381.4
cs 113 632.3 2 637.1 8 412.5 19 188.9 562.5 12 918.7 2 313.3 232 969.1 4 718.6 397 352.9
da 9 460.8 11.9 56.0 20 014.9 655.2 13 800.4 71 590.8 115 590.0
de 35 653.3 1 066.1 4 037.3 20 716.9 725.0 13 156.2 2 506.5 98 808.9 5 103.7 181 773.9
el 23 684.5 15 381.7 161 856.7 200 922.9
en 36 519.3 778.3 4 618.7 23 062.9 727.6 15 593.0 2 663.8 267 843.8 5 272.8 357 080.3
eo 221.0 221.0
es 29 664.1 165.1 830.9 26 269.3 16 248.5 2 857.8 223 006.0 6 070.2 305 112.0
et 78.8 14 884.2 10 898.7 54 487.7 80 349.3
eu 2 999.9 2 999.9
fa 32 635.9 32 635.9
fi 6 714.9 44.4 200.5 15 264.2 542.6 10 109.3 90 481.8 123 357.7
fi 20 454.4 194.3 3 687.5 26 298.4 762.6 17 186.4 3 044.3 181 033.4 5 893.7 258 555.1
fr 622.1 622.1
gl 129 458.6 129 458.6
he 402.8 429.9 832.7
hi 22 763.6 242.6 1 523.4 569.9 137 844.3 162 943.8
hr 405.3 36.6 24.4 466.3
hs 6 890.1 28.9 17 851.3 12 187.9 141 559.0 8.4 178 525.6
hu 23.5 23.5
hy 37 824.9 37 824.9
id 7 374.2 7 374.2
is 17 435.8 50.6 647.8 23 892.0 685.2 15 511.4 2 750.7 163 859.9 1 391.5 226 224.9
it 3 766.7 64.9 163.1 12 141.5 2.5 16 138.6
ja 871.1 871.1
ka 13.9 13.9
kk 5 964.3 5 964.3
ko 669.1 7.2 17.4 17 175.1 471.2 11 198.5 5 247.7 34 786.3
lt 3 207.6 362.1 66.9 17 519.4 536.7 11 682.0 2 050.4 35 425.1
lv 8 794.5 86.5 15 112.0 23 993.1
mk 1 258.4 1 258.4
ml 7 828.0 7 828.0
ms 13 805.0 13 805.0
mt 17 229.8 356.4 1 193.5 23 401.1 716.8 15 555.9 2 952.8 170 892.9 812.1 233 111.3
nl 7 690.7 138.1 392.0 723.9 39 805.6 48 750.2
no 27 056.2 283.2 754.2 19 482.9 576.1 12 662.8 2 367.5 164 059.8 227 242.6
pl 7 204.0 81.3 24 385.0 706.2 15 188.4 2 782.5 229 480.2 738.5 280 566.2
pt 8.4 5.2 13.6
rn 4 132.6 64.1 8 043.5 9 426.4 2 725.2 211 310.4 235 702.3
ro 11 757.6 143.8 518.7 565.5 104 831.9 4 312.8 122 130.4
si 2 313.4 2 313.4
sk 7 626.6 402.2 558.0 18 398.8 560.8 12 727.0 34 589.4 74 862.7
sl 4 611.2 6.1 22.4 18 510.4 12 249.8 83 057.1 118 457.1
sq 9 171.4 9 171.4
sr 12 556.0 29.3 119.3 152 425.6 165 130.2
sv 18 011.7 454.8 1 273.0 19 443.0 637.9 13 777.6 81 490.5 135 088.4
ta 104.0 104.0
te 96.0 96.0
th 5 626.0 5 626.0
tl 37.0 37.0
tr 147 635.3 147 635.3
uk 14 478.3 38.9 333.0 596.1 3 779.0 19 225.4
ur 155.7 155.7
vi 19 281.4 19 281.4
zh 215.4 70 963.9 675.9 71 855.3
TOTAL 473 208.2 7 852.9 29 450.5 424 874.2 12 050.1 276 542.6 26 964.4 3 970 272.9 35 385.2 5 256 601.0

Detailed statistics

In addition to the corpus size date, the table includes also measures of statistical complexity and diversity. For languages without linguistic annotation, the table shows only the wordform-based measure of lexical diversity (lexDivWord).

Lang Collection Number of Thousands of Lexical diversity Syntactic complexity (average)
docs texts sentences words tokens lexDivWord lexDivLemma sLength subRatio maxTreeDepth maxNPLength maxNPDepth mdd
afSubtitles 1 24 23.0 134.6 161.7 406.4 347.2 5.887 1.093 0.095 2.377 0.811 2.251
arCore-fiction 2 2 2.1 28.8 35.6 620.3 576.6 13.830 2.712 1.310 5.293 2.016 2.817
Core-misc 1 1 1.3 5.5 7.4 451.4 421.4 4.150 1.330 0.290 1.870 0.840 2.010
Subtitles 1 34 193 28 726.4 126 195.5 157 188.9 592.8 557.3 4.421 1.338 0.336 2.216 0.986 1.678
Syndicate 3 433 19.0 384.5 439.0 622.7 560.3 20.513 2.485 1.312 11.036 3.940 2.405
beCore-fiction 104 104 625.1 7 068.7 8 978.9 615.4 492.7 11.583 1.865 0.804 4.122 1.436 2.316
Core-misc 4 4 7.6 57.7 76.0 556.2 425.6 7.608 1.672 0.605 2.870 1.002 2.254
bgCore-fiction 87 87 559.6 7 067.3 8 597.7 548.3 439.5 13.125 1.728 0.732 4.255 1.532 2.497
Acquis 1 10 846 862.3 13 582.3 16 991.2 392.4 306.3 18.073 1.801 0.514 9.389 2.805 3.265
Europarl 1 45 271 408.3 9 082.0 10 379.8 498.4 386.3 23.014 2.538 1.263 10.961 3.402 2.581
Subtitles 1 40 986 32 591.1 164 644.1 214 988.4 518.2 384.6 5.089 1.336 0.322 1.861 0.706 1.931
bnSubtitles 1 252 363.8 1 517.7 2 072.1 419.4
brSubtitles 1 27 19.7 97.4 145.2 363.5
bsSubtitles 1 14 208 12 165.3 56 465.9 75 945.3 450.2
caCore-fiction 91 91 678.0 9 951.3 11 363.4 471.6 375.2 15.579 2.140 0.962 6.099 1.920 2.551
Core-misc 1 1 0.7 9.7 11.2 463.7 362.5 14.300 2.040 0.930 5.850 1.880 2.520
Bible 2 66 50.3 728.2 839.4 405.3 308.0 15.729 2.056 0.912 6.460 2.103 2.602
Subtitles 1 670 472.8 2 692.1 3 403.2 487.0 346.8 5.726 1.379 0.352 2.617 0.926 2.028
csCore-fiction 1 629 1.629 9 979.9 113 632.3 141 075.8 629.8 484.2 11.722 1.702 0.723 4.078 1.459 2.486
Core-nonfict 113 113 488.9 8 412.5 10 107.3 649.3 501.8 18.099 2.107 1.004 8.159 2.685 2.607
Core-misc 70 70 222.5 2 637.1 3 208.3 639.0 492.3 12.264 1.721 0.704 5.105 1.778 2.412
Acquis 1 19 269 1 351.5 19 188.9 25 140.4 472.1 346.5 16.575 1.745 0.536 9.788 2.858 3.025
Bible 2 66 51.0 562.5 692.9 537.1 372.0 11.907 1.603 0.635 4.125 1.590 2.451
Europarl 1 69 482 685.3 12 918.7 15 030.4 600.9 435.0 19.380 2.428 1.256 9.361 3.180 2.527
PressEurop 7 7 060 170.0 2 313.3 2 786.6 669.3 522.4 14.002 1.895 0.810 7.023 2.498 2.457
Subtitles 1 60 619 48 207.7 232 969.1 313 262.9 589.7 406.3 4.866 1.307 0.319 1.862 0.694 1.971
Syndicate 21 6 117 264.0 4 718.6 5 496.6 655.9 506.1 18.410 2.162 1.059 8.528 2.975 2.552
daCore-fiction 90 90 685.3 9 460.8 11 273.9 464.6 388.7 14.334 1.712 0.694 4.949 1.649 2.514
Core-nonfict 1 1 2.7 56.0 64.2 447.6 364.4 21.690 2.250 1.070 9.140 2.900 2.670
Core-misc 2 2 0.8 11.9 14.2 441.6 363.4 14.515 1.714 0.728 5.350 1.836 2.466
Acquis 1 18 263 1 566.7 20 014.9 25 402.6 395.0 333.1 14.462 1.647 0.485 8.314 2.491 2.762
Bible 2 66 46.1 655.2 782.3 389.8 318.7 18.349 1.970 0.843 5.542 1.828 2.811
Europarl 1 67 202 721.6 13 800.4 15 775.5 448.2 376.6 19.372 2.025 0.910 9.165 2.947 2.597
Subtitles 1 15 985 13 559.9 71 590.8 92 880.6 438.1 346.4 5.338 1.184 0.190 1.985 0.701 1.925
deCore-fiction 412 412 2 603.0 35 653.3 43 380.5 515.3 421.1 14.176 1.775 0.702 4.819 1.458 3.095
Core-nonfict 43 43 205.6 4 037.3 4 754.4 525.3 434.8 20.302 2.015 0.862 8.836 2.456 3.384
Core-misc 16 16 63.4 1 066.1 1 255.8 515.6 425.5 17.694 1.942 0.817 7.345 2.138 3.219
Acquis 1 18 782 1 451.4 20 716.9 26 206.7 407.9 343.3 16.124 1.506 0.388 9.197 2.496 3.519
Bible 2 66 49.2 725.0 854.0 395.2 302.4 15.637 1.648 0.657 5.263 1.737 2.998
Europarl 1 62 391 661.2 13 156.2 15 169.0 487.1 396.8 20.448 2.074 0.914 9.361 2.646 3.473
PressEurop 7 6 909 175.9 2 506.5 3 013.6 545.0 456.2 14.623 1.702 0.621 6.859 2.124 3.123
Subtitles 1 21 322 18 354.4 98 808.9 129 234.7 489.6 380.9 5.414 1.240 0.231 2.119 0.712 2.271
Syndicate 21 5 814 263.6 5 103.7 5 905.3 541.1 453.0 19.817 2.000 0.867 8.766 2.590 3.380
elAcquis 1 18 904 1 432.0 23 684.5 28 955.7 409.0 313.2 17.722 1.884 0.707 10.688 2.957 2.690
Europarl 1 68 069 623.6 15 381.7 17 233.2 488.5 366.9 25.498 2.664 1.379 12.485 3.413 2.682
Subtitles 1 38 711 31 118.9 161 856.7 208 587.8 516.7 376.8 6.335 1.613 0.519 2.566 0.881 2.083
enCore-fiction 366 366 2 701.0 36 519.3 43 557.4 466.2 403.2 14.159 2.107 0.945 5.371 1.689 2.576
Core-nonfict 39 39 216.2 4 618.7 5 302.9 466.7 412.4 22.976 2.623 1.292 10.373 2.893 2.793
Core-misc 17 17 53.4 778.3 905.9 455.8 393.7 15.091 2.160 0.967 6.561 1.987 2.557
Acquis 1 18 930 1 327.2 23 062.9 28 075.3 346.1 307.3 20.073 2.193 0.806 11.086 2.912 3.176
Bible 2 66 47.5 727.6 843.4 354.0 296.2 17.458 2.166 1.051 6.271 2.125 2.608
Europarl 1 69 283 680.9 15 593.0 17 455.0 411.9 362.9 23.743 2.692 1.402 11.274 3.135 2.736
PressEurop 7 7 019 152.5 2 663.8 3 107.7 485.4 431.4 18.016 2.286 1.033 8.828 2.614 2.689
Subtitles 1 55 657 49 130.9 267 843.8 344 553.0 445.1 362.4 5.491 1.401 0.372 2.273 0.811 2.067
Syndicate 21 6 113 263.1 5 272.8 6 090.3 494.2 438.7 20.792 2.447 1.186 9.516 2.843 2.733
eoSubtitles 1 46 48.4 221.0 305.4 384.4
esCore-fiction 338 338 1 981.3 29 664.1 34 294.8 495.7 400.3 15.586 2.176 0.974 6.243 1.919 2.574
Core-nonfict 10 10 29.6 830.9 932.4 446.5 361.9 29.055 2.939 1.456 13.399 3.468 2.797
Core-misc 7 7 15.0 165.1 198.8 475.1 370.6 11.674 1.781 0.662 4.887 1.575 2.382
Acquis 1 19 056 1 333.1 26 269.3 31 277.0 348.0 290.7 22.339 1.851 0.588 12.954 3.099 3.098
Europarl 1 67 754 660.7 16 248.5 18 032.0 437.6 353.4 25.496 2.614 1.350 12.798 3.348 2.618
PressEurop 7 6 891 154.6 2 857.8 3 268.7 478.0 399.9 18.995 2.144 0.940 9.483 2.729 2.567
Subtitles 1 50 705 40 849.5 223 006.0 293 901.1 498.6 355.5 5.499 1.404 0.373 2.378 0.862 1.972
Syndicate 21 6 037 256.4 6 070.2 6 759.4 462.1 384.1 24.411 2.437 1.189 11.558 3.194 2.675
etCore-fiction 1 1 6.7 78.8 96.1 626.8 478.0 11.790 2.020 0.920 4.200 1.540 2.530
Acquis 1 18 727 1 349.8 14 884.2 19 414.5 543.8 404.0 13.084 2.744 0.961 6.654 2.304 2.972
Europarl 1 68 478 704.3 10 898.7 12 761.7 635.2 463.0 15.935 2.687 1.347 7.271 2.669 2.517
Subtitles 1 13 503 11 843.3 54 487.7 72 454.4 575.2 386.4 4.625 1.284 0.281 1.616 0.600 1.967
euSubtitles 1 652 732.9 2 999.9 4 039.0 600.9 401.1 4.112 1.280 0.265 1.371 0.522 1.745
faSubtitles 1 6 556 6 594.8 32 635.9 38 097.3 520.5 472.5 4.973 1.368 0.338 2.363 0.974 2.301
fiCore-fiction 106 106 661.7 6 714.9 8 221.3 683.9 507.1 10.287 1.844 0.806 3.437 1.295 2.279
Core-nonfict 4 4 14.4 200.5 237.0 685.3 489.0 14.336 2.401 1.208 5.977 2.378 2.435
Core-misc 2 2 3.5 44.4 52.2 733.0 532.9 12.820 2.148 1.051 4.791 1.821 2.385
Acquis 1 18 563 1 310.5 15 264.2 19 702.1 556.9 380.4 13.209 2.369 0.886 6.990 2.588 2.647
fiBible 2 66 48.0 542.6 675.3 529.0 351.4 13.324 1.911 0.871 4.231 1.534 2.511
Europarl 1 67 019 675.6 10 109.3 11 838.6 670.8 462.7 15.260 2.483 1.242 6.924 2.670 2.395
Subtitles 1 30 900 23 262.2 90 481.8 124 969.7 666.5 444.7 3.909 1.244 0.242 1.404 0.513 1.689
frCore-fiction 230 230 1 277.5 20 454.4 23 802.5 471.0 377.5 16.762 2.156 0.998 6.617 1.994 2.685
Core-nonfict 37 37 152.5 3 687.5 4 206.8 456.2 373.9 26.628 2.938 1.451 12.424 3.202 2.807
Core-misc 10 10 20.0 194.3 229.5 443.7 336.9 9.973 1.703 0.614 4.205 1.321 2.427
Acquis 1 19 057 1 338.5 26 298.4 31 764.2 353.5 289.2 22.521 2.416 0.946 13.347 3.212 3.144
Bible 2 66 50.6 762.6 886.3 384.9 285.9 17.822 2.060 0.893 6.743 2.171 2.758
Europarl 1 68 220 677.7 17 186.4 18 984.0 425.6 338.2 26.070 2.866 1.565 13.013 3.423 2.638
PressEurop 7 7 025 163.8 3 044.3 3 510.4 476.4 396.4 19.097 2.279 1.036 9.836 2.826 2.606
Subtitles 1 38 341 30 038.8 181 033.4 225 399.3 453.5 325.6 6.061 1.405 0.394 2.563 0.926 2.031
Syndicate 21 5 585 238.3 5 893.7 6 542.1 457.8 379.9 25.332 2.742 1.410 12.251 3.308 2.698
glSubtitles 1 146 121.7 622.1 797.9 529.5 411.1 5.144 1.339 0.323 2.602 0.940 1.958
heSubtitles 1 33 935 27 608.8 129 458.6 172 973.7 549.8 479.7 4.747 1.370 0.346 2.637 1.064 1.918
hiCore-fiction 7 7 35.6 402.8 462.1 449.6 348.6 11.386 1.586 0.524 4.610 1.625 2.692
Subtitles 1 54 81.0 429.9 526.0 401.1 324.0 5.336 1.156 0.146 2.358 0.838 2.190
hrCore-fiction 292 292 1 822.5 22 763.6 27 339.1 591.4 460.5 12.720 1.902 0.837 4.247 1.472 2.623
Core-nonfict 22 22 72.0 1 523.4 1 742.7 600.4 451.3 21.425 2.682 1.368 9.323 2.963 2.718
Core-misc 10 10 19.6 242.6 298.0 570.4 431.2 12.616 2.062 0.909 4.645 1.536 2.570
Bible 2 66 48.1 569.9 686.1 519.1 381.3 12.989 1.855 0.773 4.359 1.599 2.504
Subtitles 1 35 057 28 796.3 137 844.3 178 347.5 566.6 421.2 4.795 1.392 0.373 1.814 0.681 1.929
hsCore-fiction 8 8 36.2 405.3 512.1 503.3
Core-nonfict 1 1 1.9 24.4 29.5 571.5
Core-misc 4 4 3.5 36.6 44.7 513.6
huCore-fiction 87 87 573.2 6 890.1 8 657.7 603.6 499.1 12.888 1.709 0.706 3.698 1.367 2.759
Core-misc 2 2 6.1 28.9 39.5 568.2 457.3 4.817 1.269 0.254 1.817 0.650 2.100
Acquis 1 18 539 1 290.2 17 851.3 22 815.8 485.6 385.2 16.126 1.825 0.515 7.743 2.832 3.421
Europarl 1 66 229 677.3 12 187.9 14 266.5 591.1 469.4 18.625 2.202 1.013 7.465 2.741 2.799
Subtitles 1 41 067 31 962.7 141 559.0 194 622.6 586.7 466.0 4.609 1.261 0.268 1.644 0.627 1.859
Syndicate 3 9 0.5 8.4 9.8 598.4 481.5 16.869 2.080 0.933 6.351 2.436 2.685
hySubtitles 1 7 3.9 23.5 30.6 601.7 445.9 6.057 1.375 0.382 2.179 0.860 2.075
idSubtitles 1 8 350 8 112.7 37 824.9 49 694.7 475.7 401.9 4.699 1.344 0.317 2.343 0.911 1.742
isSubtitles 1 1 135 1 497.9 7 374.2 9 299.9 503.5 369.4 4.951 1.233 0.233 1.913 0.699 1.841
itCore-fiction 164 164 1 205.7 17 435.8 20 566.1 529.6 414.7 15.157 2.092 0.973 6.471 1.970 2.578
Core-nonfict 5 5 22.4 647.8 738.9 486.6 389.2 31.080 3.082 1.564 16.597 3.877 2.931
Core-misc 2 2 4.0 50.6 61.7 505.7 378.9 14.351 2.299 1.040 5.722 1.817 2.633
Acquis 1 18 893 1 345.7 23 892.0 29 413.1 390.7 306.5 20.391 2.112 0.766 13.152 3.242 3.156
Bible 2 65 47.3 685.2 806.6 421.8 317.0 16.561 1.969 0.881 6.739 2.168 2.723
Europarl 1 69 139 650.3 15 511.4 17 235.8 486.8 381.6 24.916 2.686 1.409 13.989 3.644 2.603
PressEurop 7 7 024 156.3 2 750.7 3 155.3 524.2 421.2 18.041 2.121 0.943 9.814 2.803 2.553
Subtitles 1 37 721 29 870.5 163 859.9 212 801.7 532.8 384.1 5.518 1.325 0.319 2.535 0.903 2.008
Syndicate 11 1 388 58.9 1 391.5 1 564.2 504.3 403.4 24.516 2.535 1.261 12.837 3.463 2.682
jaCore-fiction 33 33 201.2 3 766.7 4 262.0 365.4 336.3 18.928 3.094 1.432 8.630 2.666 2.697
Core-nonfict 1 1 7.0 163.1 184.2 361.2 334.5 23.420 3.540 1.720 11.650 3.490 2.810
Core-misc 1 1 2.1 64.9 75.9 280.9 257.8 31.520 4.300 1.990 16.990 4.490 3.160
Subtitles 1 2 326 2 086.3 12 141.5 13 495.4 381.9 348.5 6.212 1.417 0.375 3.221 1.312 1.909
Syndicate 1 2 0.1 2.5 2.9 385.4 372.0 38.705 4.330 2.015 20.881 4.923 3.215
kaSubtitles 1 204 198.4 871.1 1 179.0 380.8
kkSubtitles 1 4 4.1 13.9 19.2 657.7 607.3 3.389 1.243 0.247 1.761 0.892 1.603
koSubtitles 1 1 605 1 641.1 5 964.3 7 294.3 690.6 686.3 3.682 1.529 0.457 1.146 0.440 1.785
ltCore-fiction 20 20 61.4 669.1 842.6 685.2 545.5 11.223 1.901 0.813 3.957 1.479 2.487
Core-nonfict 1 1 1.3 17.4 23.1 657.2 492.4 14.180 2.190 0.930 6.670 2.230 2.600
Core-misc 2 2 1.2 7.2 9.0 764.9 628.4 6.184 1.430 0.409 2.921 1.136 1.887
Acquis 1 18 809 1 477.8 17 175.1 22 835.1 515.0 346.3 13.456 2.504 0.938 6.985 2.531 2.867
Bible 2 66 46.1 471.2 596.3 550.9 439.8 10.822 1.668 0.706 3.866 1.500 2.281
Europarl 1 67 719 688.5 11 198.5 13 475.2 627.2 441.4 16.816 3.016 1.607 7.683 2.906 2.469
Subtitles 1 1 025 1 345.9 5 247.7 7 353.0 624.6 461.7 3.923 1.278 0.286 1.552 0.569 1.760
lvCore-fiction 65 65 291.9 3 207.6 4 032.0 639.2 494.8 11.339 1.758 0.756 3.605 1.343 2.563
Core-nonfict 1 1 3.3 66.9 89.0 680.0 541.3 21.810 2.310 1.070 9.480 2.800 2.910
Core-misc 7 7 30.0 362.1 440.5 688.1 543.4 12.147 1.759 0.776 4.397 1.668 2.337
Acquis 1 18 348 1 486.3 17 519.4 23 361.6 490.0 340.4 13.790 2.296 0.831 7.109 2.492 2.865
Bible 2 66 40.1 536.7 671.7 495.5 343.1 13.645 1.663 0.754 4.180 1.602 2.658
Europarl 1 67 482 683.7 11 682.0 13 896.8 590.6 416.3 17.627 2.434 1.255 7.884 2.853 2.497
Subtitles 1 387 488.4 2 050.4 2 801.9 592.2 425.9 4.227 1.269 0.264 1.568 0.592 1.811
mkCore-fiction 104 104 694.6 8 794.5 10 571.7 464.3
Core-misc 4 4 12.1 86.5 109.3 422.0
Subtitles 1 3 433 3 201.0 15 112.0 20 217.5 412.3
mlSubtitles 1 285 365.3 1 258.4 1 793.5 489.8
msSubtitles 1 1 496 1 712.1 7 828.0 10 573.3 371.2
mtAcquis 1 8 963 784.8 13 805.0 16 643.6 373.4 1.0 20.381 2.683 1.141 11.437 3.347 2.933
nlCore-fiction 194 194 1 152.0 17 229.8 19 889.7 466.9 403.0 15.424 2.149 0.959 5.255 1.558 3.176
Core-nonfict 12 12 50.6 1 193.5 1 336.2 449.2 391.5 25.698 2.909 1.375 10.658 2.784 3.453
Core-misc 9 9 27.2 356.4 413.4 463.6 395.9 13.450 1.993 0.860 5.102 1.550 2.981
Acquis 1 18 975 1 483.9 23 401.1 28 140.1 356.2 317.3 18.005 2.217 0.766 9.491 2.375 3.553
Bible 2 66 45.8 716.8 821.3 386.8 326.1 17.940 2.264 1.067 5.942 1.936 3.042
Europarl 1 67 139 693.8 15 555.9 17 074.8 425.7 371.8 22.952 2.500 1.217 10.132 2.744 3.274
PressEurop 7 7 009 175.4 2 952.8 3 337.6 483.7 429.2 17.267 2.172 0.967 7.879 2.300 3.107
Subtitles 1 38 546 29 399.1 170 892.9 212 492.5 444.4 354.8 5.847 1.485 0.439 2.180 0.728 2.291
Syndicate 5 841 37.6 812.1 897.2 477.2 422.9 22.671 2.570 1.225 10.005 2.810 3.282
noCore-fiction 91 91 558.7 7 690.7 9 028.3 461.6 383.6 14.346 1.842 0.849 4.850 1.579 2.599
Core-nonfict 5 5 17.0 392.0 439.5 467.7 381.2 24.705 2.716 1.366 11.077 3.100 2.842
Core-misc 6 6 10.7 138.1 163.5 450.0 372.8 14.035 1.835 0.794 5.029 1.509 2.619
Bible 2 66 55.3 723.9 831.4 364.6 294.7 13.099 1.573 0.620 4.645 1.713 2.447
Subtitles 1 8 995 7 702.8 39 805.6 50 657.6 448.4 353.0 5.188 1.299 0.298 1.960 0.697 1.917
plCore-fiction 328 328 2 400.0 27 056.2 33 548.9 632.2 499.6 11.498 1.896 0.833 4.162 1.523 2.355
Core-nonfict 11 11 36.6 754.2 897.7 613.7 460.1 20.825 2.743 1.407 9.385 3.192 2.509
Core-misc 9 9 24.4 283.2 345.5 622.9 471.6 12.263 1.981 0.881 4.978 1.892 2.266
Acquis 1 19 024 1 657.3 19 482.9 24 945.6 481.4 350.6 13.373 2.035 0.714 7.737 2.681 2.622
Bible 2 66 48.2 576.1 712.9 537.0 387.8 12.695 1.724 0.727 4.479 1.725 2.397
Europarl 1 67 443 713.3 12 662.8 14 667.8 607.5 447.2 18.340 2.643 1.309 9.387 3.283 2.322
PressEurop 7 6 999 166.6 2 367.5 2 879.1 659.8 520.6 14.632 2.143 0.957 7.092 2.645 2.334
Subtitles 1 46 175 36 236.0 164 059.8 222 210.4 602.1 441.5 4.556 1.324 0.319 1.855 0.717 1.832
ptCore-fiction 82 82 519.8 7 204.0 8 608.5 511.2 408.1 14.436 2.299 1.142 6.372 2.041 2.497
Core-misc 5 5 6.9 81.3 96.0 495.3 388.9 12.461 2.159 0.977 6.238 1.780 2.486
Acquis 1 18 934 1 356.4 24 385.0 29 549.7 377.3 305.5 20.372 2.488 0.967 12.971 3.327 3.020
Bible 2 66 54.3 706.2 840.4 380.3 293.5 19.149 2.385 1.111 7.620 2.305 2.957
Europarl 1 65 92 648.7 15 188.4 17 127.0 467.5 379.1 24.202 3.093 1.726 13.821 3.724 2.591
PressEurop 7 6 967 160.9 2 782.5 3 286.5 507.4 422.0 17.848 2.388 1.150 10.138 2.951 2.536
Subtitles 1 54 342 43 730.9 229 480.2 294 774.7 495.5 360.1 5.278 1.449 0.432 2.528 0.955 1.939
Syndicate 8 747 32.4 738.5 839.0 489.9 405.1 23.875 2.980 1.575 12.669 3.544 2.646
rnCore-fiction 1 1 1.1 8.4 11.1 424.3
Core-misc 1 1 0.7 5.2 6.6 416.4
roCore-fiction 44 44 233.3 4 132.6 4 833.2 534.2 406.3 18.106 2.262 1.146 6.360 2.019 2.604
Core-misc 1 1 2.7 64.1 74.2 539.5 414.1 23.970 2.690 1.500 10.330 2.910 2.680
Acquis 1 6 318 650.0 8 043.5 9 884.4 405.3 301.4 14.150 2.221 0.770 7.930 2.544 2.900
Europarl 1 44 143 406.6 9 426.4 10 585.4 499.1 368.7 23.966 2.798 1.517 11.591 3.558 2.484
PressEurop 7 6 991 160.6 2 725.2 3 192.6 546.7 429.5 17.486 2.219 1.017 8.508 2.772 2.492
Subtitles 1 45 407 38 108.1 211 310.4 266 731.5 509.0 351.2 5.572 1.388 0.383 2.129 0.795 1.954
Core-nonfict 10 10 30.6 518.7 625.2 645.0 495.9 17.765 2.613 1.223 8.126 2.801 2.603
Core-fiction 144 144 1 043.5 11 757.6 14 913.7 633.0 501.9 11.643 1.959 0.865 4.203 1.557 2.386
Core-misc 6 6 12.8 143.8 180.7 633.2 484.5 11.439 1.947 0.870 4.378 1.718 2.265
Bible 2 66 39.0 565.5 703.9 486.6 346.2 20.730 2.746 1.302 6.198 2.121 2.828
Subtitles 1 27 195 21 625.8 104 831.9 141 586.8 574.9 428.1 4.878 1.423 0.401 1.930 0.744 1.887
Syndicate 21 5 418 233.5 4 312.8 5 110.5 637.5 487.3 19.037 2.653 1.288 9.232 3.298 2.424
siSubtitles 1 499 522.5 2 313.4 3 021.8 443.6
skCore-fiction 142 142 706.0 7 626.6 9 513.5 617.0 480.8 10.845 1.562 0.612 3.503 1.284 2.620
Core-nonfict 10 10 39.1 558.0 687.3 650.0 517.1 14.785 1.516 0.547 6.760 2.344 2.518
Core-misc 13 13 32.4 402.2 496.9 652.5 515.7 12.636 1.564 0.555 5.338 1.707 2.493
Acquis 1 18 302 1 363.0 18 398.8 23 542.1 482.7 353.1 15.458 1.732 0.516 8.677 2.746 3.029
Bible 2 65 46.9 560.8 690.8 520.0 373.4 12.716 1.615 0.662 4.178 1.576 2.567
Europarl 1 67 731 677.8 12 727.0 14 735.3 595.1 433.8 19.150 2.344 1.172 9.020 3.065 2.538
Subtitles 1 8 322 7 214.8 34 589.4 46 215.1 575.9 411.5 4.821 1.293 0.295 1.835 0.674 1.975
slCore-fiction 71 71 370.5 4 611.2 5 686.2 556.5 428.7 12.704 2.096 0.857 4.122 1.374 2.641
Core-nonfict 1 1 1.1 22.4 24.9 656.4 528.9 21.090 1.980 0.830 8.840 2.930 2.890
Core-misc 1 1 0.7 6.1 7.4 682.1 585.6 8.950 1.720 0.650 4.410 1.720 2.210
Acquis 1 17 414 1 399.2 18 510.4 24 069.9 466.2 335.6 15.345 1.810 0.580 8.359 2.683 2.841
Europarl 1 65 366 649.6 12 249.8 14 263.6 564.3 405.6 19.433 2.551 1.254 9.220 3.066 2.539
Subtitles 1 21 607 18 080.2 83 057.1 111 736.8 568.0 399.0 4.620 1.333 0.309 1.726 0.625 1.899
sqSubtitles 1 1 575 1 769.0 9 171.4 12 098.4 395.5
srCore-fiction 143 143 931.6 12 556.0 15 029.8 584.7 462.0 13.767 1.956 0.898 4.690 1.601 2.638
Core-nonfict 2 2 5.9 119.3 138.9 565.0 417.2 20.654 2.876 1.518 8.918 2.889 2.655
Core-misc 3 3 5.0 29.3 38.9 538.0 411.7 5.882 1.394 0.371 2.405 0.906 2.215
Subtitles 1 38 029 31 175.3 152 425.6 196 520.1 561.3 445.3 4.901 1.338 0.333 1.905 0.722 1.938
svCore-fiction 208 208 1 398.8 18 011.7 20 456.7 490.6 403.5 13.175 1.944 0.848 4.403 1.445 2.501
Core-nonfict 16 16 64.9 1 273.0 1 403.1 508.2 415.4 19.801 2.541 1.288 7.980 2.435 2.683
Core-misc 8 8 28.5 454.8 512.3 490.1 404.4 16.027 2.123 1.026 5.575 1.790 2.561
Acquis 1 17 133 1 285.5 19 443.0 23 283.7 402.1 327.7 16.286 1.913 0.705 8.700 2.448 2.784
Bible 2 66 43.9 637.9 731.7 414.2 323.2 14.907 1.947 0.895 4.760 1.703 2.542
Europarl 1 67 898 720.6 13 777.6 15 146.8 461.9 374.1 19.313 2.381 1.183 8.221 2.554 2.640
Subtitles 1 19 41 15 571.7 81 490.5 103 181.3 455.7 352.1 5.256 1.319 0.303 1.921 0.684 1.921
taSubtitles 1 20 29.4 104.0 141.8 511.8 434.1 3.562 1.196 0.171 1.673 0.639 1.807
teSubtitles 1 18 26.0 96.0 127.1 496.5 1.0 3.806 1.324 0.284 1.746 0.658 2.086
thSubtitles 1 3 932 3 457.0 5 626.0 7 288.3 658.1
tlSubtitles 1 5 8.0 37.0 52.7 344.9
trSubtitles 1 44 015 35 975.7 147 635.3 199 108.2 670.1 424.8 4.133 1.259 0.257 1.929 0.853 1.815
ukCore-fiction 192 192 1 260.0 14 478.3 18 490.6 626.6 506.4 11.923 2.047 0.892 4.187 1.507 2.377
Core-nonfict 5 5 19.1 333.0 416.1 621.1 469.6 19.193 2.945 1.432 8.468 2.909 2.517
Core-misc 2 2 4.0 38.9 50.3 614.9 484.8 9.801 1.851 0.774 3.366 1.282 2.254
Bible 2 66 41.5 596.1 738.1 475.7 352.8 14.784 1.804 0.777 4.921 1.751 2.585
Subtitles 1 1 006 813.4 3 779.0 5 123.2 571.4 461.9 4.684 1.360 0.334 1.853 0.710 1.897
urSubtitles 1 19 27.0 155.7 180.8 397.6 344.1 5.885 1.204 0.178 2.777 1.098 2.260
viSubtitles 1 3 468 3 304.5 19 281.4 23 984.0 446.3 403.8 5.931 1.508 0.458 2.351 0.945 1.849
zhCore-fiction 3 3 11.7 215.4 253.9 382.0 376.8 18.467 4.655 1.684 4.099 1.594 3.435
Subtitles 1 11 378 11 952.3 70 963.9 79 539.4 448.9 439.5 6.046 1.689 0.548 2.081 0.791 2.289
Syndicate 5 654 29.7 675.9 766.7 493.8 489.5 23.166 4.110 1.795 7.026 2.391 3.366

Metadata

Metadata such as the text's title, author, or source language are available for most texts as attributes of structural elements such as text or sentence. To view the list of such attributes and to select those that should be displayed in the KonText query results, choose the relevant InterCorp 16ud language in the KonText corpus search tool, and then go to Structures or References in the Corpus-specific settings menu.

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

Pre-processing

Linguistic annotation

* UDPipe (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel)

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Nádvorníková, O., Rosen, A., Šimík, B., Vavřín, M., Zasina, A. J. (2024). The InterCorp Corpus – Czech2), version 16ud of ?? June 2024. Institute of the Czech National Corpus, Charles University, Prague 2024. Available on-line: https://kontext.korpus.cz/

See also

1)
The tool uses all data for the given language, ie all treebanks listed on https://lindat.mff.cuni.cz/services/udpipe/UDPipe. Annotation of this release used the following models: TODO!!! arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.
2)
Insert languages actually used.