InterCorp Release 16ud – Universal Dependencies

InterCorp Release 16ud – Universal Dependencies

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	154 391 397	362 409 841	461 601 109	5 732 688 636
Positions	Number of word forms	124 681 856	272 671 041	385 829 717	4 473 418 338
Structural attributes	Number of documents	1 812	33	4 643	338
	Number of texts	1 812	162 613	4 643	2 662 675
	Number of sentences	10 691 340	50 729 559	28 684 709	790 046 584
Further information	reference	YES
	representative	NO
	publication date	2024
	foreign languages	61
	tagged languages	48
	lemmatized languages	48
	syntactically annotated languages	48

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Alexandr Rosen if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

Main features of release 16ud

For a detailed description of UD as used in the annotation of InterCorp see the Universal Dependencies entry in the glossary.
After 13ud, 16ud is the second release of InterCorp featuring linguistic annotation according to the Universal Dependencies scheme.
Release 16ud is the first CNC corpus to feature the metrics of syntactic complexity and lexical diversity.¹⁾
In release 16ud, out of the total number of 62 languages (including Czech), 48 are linguistically annotated; in addition, all such languages are syntactically annotated.
Texts are annotated in the same way in all languages, according to the UD standard (Universal Dependencies).
Annotation was performed for all languages by UDPipe, based on the data created in the UD project.²⁾

Texts in the corpus

InterCorp release 16ud contains the same texts as InterCorp release 16. They differ only in linguistic annotation. However, the token and word count data in 16ud may differ slightly due to a different tokenization method.

The core of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually checked. The other texts, grouped in collections, are aligned automatically without human intervention. The choice in the present release includes:

Political commentaries published by Project Syndicate (below referred to as Syndicate) and VoxEurop (formerly PressEurop)
A colection of legal texts of the European Union form the Acquis Communautaire corpus (Acquis)
Proceedings of the European Parliament dated 2007–2011 from the Europarl corpus (Europarl)
Film subtitles from the Open Subtitles database (Subtitles)
Translations of the Bible

In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16ud published in September 2024 is 5 257 mil. words. This number includes 382 mil. words in the aligned foreign language texts in the core part and 4 746 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see Version history). The share of the core and the collections in the corpus is shown in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

The corpus in numbers

Number of texts in the Core

Language		Number of texts	including originals
ar	Arabic	3	1
be	Belarusian	108	14
bg	Bulgarian	87	19
ca	Catalan	92	1
cs	Czech	1 812	368
da	Danish	93	9
de	German	471	163
en	English	422	271
es	Spanish	355	142
et	Estonian	1	0
fi	Finnish	112	36
fr	French	277	126
hi	Hindi	7	2
hr	Croatian	324	37
hs	Upper Sorbian	13	5
hu	Hungarian	89	1
it	Italian	171	26
ja	Japanese	35	15
lt	Lithuanian	23	4
lv	Latvian	73	15
mk	Macedonian	108	4
nl	Dutch	215	52
no	Norwegian	102	23
pl	Polish	348	54
pt	Portuguese	87	24
rn	Romani	2	2
ro	Romanian	45	5
ru	Russian	160	37
sk	Slovak	165	62
sl	Slovene	73	25
sr	Serbian	148	8
sv	Swedish	232	101
uk	Ukrainian	199	8
zh	Chinese	3	3
TOTAL		6 495	1 668

In the tables below, the Core part of the corpus is split according to the text type into fiction (Core-fiction), non-fiction (Core-nonfiction), and miscellaneous (Core-misc), including drama, poetry or children's literature).

Corpus size by collection

Collection	Number of		Thousands of
Collection	docs	texts	sentences	words	tokens
Core-fiction	5 879	5 879	37 270	473 208	572 187
Core-misc	226	226	623	7 853	9 424
Core-nonfiction	350	350	1 483	29 450	34 381
Acquis	22	380 049	28 903	424 874	531 415
Bible	38	1 252	899	12 050	14 405
Europarl	21	1 369 378	13 709	276 543	315 134
PressEurop	70	69 894	1 637	26 964	31 538
Subtitles	58	965 557	793 931	3 970 273	5 162 184
Syndicate	162	39 158	1 697	35 385	40 423
TOTAL	6 826	2 831 743	880 152	5 256 601	6 711 091

Corpus size by language

Lang	Number of		Thousands of
Lang	docs	texts	sentences	words	tokens
af	1	24	23.0	134.6	161.7
ar	7	34 629	28 748.8	126 614.3	157 671.0
be	108	108	632.7	7 126.4	9 054.9
bg	90	97 190	34 421.2	194 375.7	250 957.1
bn	1	252	363.8	1 517.7	2 072.1
br	1	27	19.7	97.4	145.2
bs	1	14 208	12 165.3	56 465.9	75 945.3
ca	95	828	1 201.8	13 381.4	15 617.1
cs	1 845	164 425	61 420.9	397 352.9	516 801.2
da	98	101 609	16 583.0	115 590.0	146 193.4
de	504	115 755	23 827.8	181 773.9	229 774.0
el	3	125 684	33 174.5	200 922.9	254 776.7
en	455	157 490	54 572.6	357 080.3	449 890.9
eo	1	46	48.4	221.0	305.4
es	386	150 798	45 280.2	305 112.0	388 664.2
et	4	100 709	13 904.0	80 349.3	104 726.8
eu	1	652	732.9	2 999.9	4 039.0
fa	1	6 556	6 594.8	32 635.9	38 097.3
fi	117	116 660	25 976.1	123 357.7	165 696.1
fr	310	138 571	33 957.7	258 555.1	315 325.2
gl	1	146	121.7	622.1	797.9
he	1	33 935	27 608.8	129 458.6	172 973.7
hi	8	61	116.6	832.7	988.1
hr	327	35 447	30 758.6	162 943.8	208 413.5
hs	13	13	41.6	466.3	586.3
hu	95	125 933	34 510.0	178 525.6	240 411.9
hy	1	7	3.9	23.5	30.6
id	1	8 350	8 112.7	37 824.9	49 694.7
is	1	1 135	1 497.9	7 374.2	9 299.9
it	194	134 401	33 361.2	226 224.9	286 343.4
ja	37	2 363	2 296.7	16 138.6	18 020.3
ka	1	204	198.4	871.1	1 179.0
kk	1	4	4.1	13.9	19.2
ko	1	1 605	1 641.1	5 964.3	7 294.3
lt	28	87 642	3 622.1	34 786.3	45 134.4
lv	78	86 356	3 023.6	35 425.1	45 293.5
mk	109	3 541	3 907.8	23 993.1	30 898.6
ml	1	285	365.3	1 258.4	1 793.5
ms	1	1 496	1 712.1	7 828.0	10 573.3
mt	1	8 963	784.8	13 805.0	16 643.6
nl	232	132 791	33 065.4	233 111.3	284 402.6
no	105	9 163	8 344.6	48 750.2	61 120.3
pl	360	140 055	41 282.4	227 242.6	300 207.8
pt	107	147 063	46 510.1	280 566.2	355 121.8
rn	2	2	1.7	13.6	17.7
ro	184	32 839	22 985.2	122 130.4	163 120.7
ru	55	102 904	39 561.2	235 702.3	295 301.3
si	1	499	522.5	2 313.4	3 021.8
sk	170	94 585	10 080.0	74 862.7	95 881.0
sl	76	104 460	20 501.3	118 457.1	155 788.9
sq	1	1 575	1 769.0	9 171.4	12 098.4
sr	149	38 177	32 117.7	165 130.2	211 727.6
sv	237	104 739	19 113.9	135 088.4	164 715.5
ta	1	20	29.4	104.0	141.8
te	1	18	26.0	96.0	127.1
th	1	3 932	3 457.0	5 626.0	7 288.3
tl	1	5	8.0	37.0	52.7
tr	1	44 015	35 975.7	147 635.3	199 108.2
uk	202	1 271	2 138.0	19 225.4	24 818.3
ur	1	19	27.0	155.7	180.8
vi	1	3 468	3 304.5	19 281.4	23 984.0
zh	9	12 035	11 993.7	71 855.3	80 560.0
TOTAL	6 826	2 831 743	880 152.2	5 256 601.0	6 711 091.0

Corpus size in thousands of words by language and collection

Lang	Core-fiction	Core-misc	Core-nonfiction	Acquis	Bible	Europarl	PressEurop	Subtitles	Syndicate	TOTAL
af	–	–	–	–	–	–	–	134.6	–	134.6
ar	28.8	5.5	–	–	–	–	–	126 195.5	384.5	126 614.3
be	7 068.7	57.7	–	–	–	–	–	–	–	7 126.4
bg	7 067.3	–	–	13 582.3	–	9 082.0	–	164 644.1	–	194 375.7
bn	–	–	–	–	–	–	–	1 517.7	–	1 517.7
br	–	–	–	–	–	–	–	97.4	–	97.4
bs	–	–	–	–	–	–	–	56 465.9	–	56 465.9
ca	9 951.3	9.7	–	–	728.2	–	–	2 692.1	–	13 381.4
cs	113 632.3	2 637.1	8 412.5	19 188.9	562.5	12 918.7	2 313.3	232 969.1	4 718.6	397 352.9
da	9 460.8	11.9	56.0	20 014.9	655.2	13 800.4	–	71 590.8	–	115 590.0
de	35 653.3	1 066.1	4 037.3	20 716.9	725.0	13 156.2	2 506.5	98 808.9	5 103.7	181 773.9
el	–	–	–	23 684.5	–	15 381.7	–	161 856.7	–	200 922.9
en	36 519.3	778.3	4 618.7	23 062.9	727.6	15 593.0	2 663.8	267 843.8	5 272.8	357 080.3
eo	–	–	–	–	–	–	–	221.0	–	221.0
es	29 664.1	165.1	830.9	26 269.3	–	16 248.5	2 857.8	223 006.0	6 070.2	305 112.0
et	78.8	–	–	14 884.2	–	10 898.7	–	54 487.7	–	80 349.3
eu	–	–	–	–	–	–	–	2 999.9	–	2 999.9
fa	–	–	–	–	–	–	–	32 635.9	–	32 635.9
fi	6 714.9	44.4	200.5	15 264.2	542.6	10 109.3	–	90 481.8	–	123 357.7
fr	20 454.4	194.3	3 687.5	26 298.4	762.6	17 186.4	3 044.3	181 033.4	5 893.7	258 555.1
gl	–	–	–	–	–	–	–	622.1	–	622.1
he	–	–	–	–	–	–	–	129 458.6	–	129 458.6
hi	402.8	–	–	–	–	–	–	429.9	–	832.7
hr	22 763.6	242.6	1 523.4	–	569.9	–	–	137 844.3	–	162 943.8
hs	405.3	36.6	24.4	–	–	–	–	–	–	466.3
hu	6 890.1	28.9	–	17 851.3	–	12 187.9	–	141 559.0	8.4	178 525.6
hy	–	–	–	–	–	–	–	23.5	–	23.5
id	–	–	–	–	–	–	–	37 824.9	–	37 824.9
is	–	–	–	–	–	–	–	7 374.2	–	7 374.2
it	17 435.8	50.6	647.8	23 892.0	685.2	15 511.4	2 750.7	163 859.9	1 391.5	226 224.9
ja	3 766.7	64.9	163.1	–	–	–	–	12 141.5	2.5	16 138.6
ka	–	–	–	–	–	–	–	871.1	–	871.1
kk	–	–	–	–	–	–	–	13.9	–	13.9
ko	–	–	–	–	–	–	–	5 964.3	–	5 964.3
lt	669.1	7.2	17.4	17 175.1	471.2	11 198.5	–	5 247.7	–	34 786.3
lv	3 207.6	362.1	66.9	17 519.4	536.7	11 682.0	–	2 050.4	–	35 425.1
mk	8 794.5	86.5	–	–	–	–	–	15 112.0	–	23 993.1
ml	–	–	–	–	–	–	–	1 258.4	–	1 258.4
ms	–	–	–	–	–	–	–	7 828.0	–	7 828.0
mt	–	–	–	13 805.0	–	–	–	–	–	13 805.0
nl	17 229.8	356.4	1 193.5	23 401.1	716.8	15 555.9	2 952.8	170 892.9	812.1	233 111.3
no	7 690.7	138.1	392.0	–	723.9	–	–	39 805.6	–	48 750.2
pl	27 056.2	283.2	754.2	19 482.9	576.1	12 662.8	2 367.5	164 059.8	–	227 242.6
pt	7 204.0	81.3	–	24 385.0	706.2	15 188.4	2 782.5	229 480.2	738.5	280 566.2
rn	8.4	5.2	–	–	–	–	–	–	–	13.6
ro	4 132.6	64.1	–	8 043.5	–	9 426.4	2 725.2	211 310.4	–	235 702.3
ru	11 757.6	143.8	518.7	–	565.5	–	–	104 831.9	4 312.8	122 130.4
si	–	–	–	–	–	–	–	2 313.4	–	2 313.4
sk	7 626.6	402.2	558.0	18 398.8	560.8	12 727.0	–	34 589.4	–	74 862.7
sl	4 611.2	6.1	22.4	18 510.4	–	12 249.8	–	83 057.1	–	118 457.1
sq	–	–	–	–	–	–	–	9 171.4	–	9 171.4
sr	12 556.0	29.3	119.3	–	–	–	–	152 425.6	–	165 130.2
sv	18 011.7	454.8	1 273.0	19 443.0	637.9	13 777.6	–	81 490.5	–	135 088.4
ta	–	–	–	–	–	–	–	104.0	–	104.0
te	–	–	–	–	–	–	–	96.0	–	96.0
th	–	–	–	–	–	–	–	5 626.0	–	5 626.0
tl	–	–	–	–	–	–	–	37.0	–	37.0
tr	–	–	–	–	–	–	–	147 635.3	–	147 635.3
uk	14 478.3	38.9	333.0	–	596.1	–	–	3 779.0	–	19 225.4
ur	–	–	–	–	–	–	–	155.7	–	155.7
vi	–	–	–	–	–	–	–	19 281.4	–	19 281.4
zh	215.4	–	–	–	–	–	–	70 963.9	675.9	71 855.3
TOTAL	473 208.2	7 852.9	29 450.5	424 874.2	12 050.1	276 542.6	26 964.4	3 970 272.9	35 385.2	5 256 601.0

Detailed statistics

In addition to the corpus size data, the table includes also measures of statistical complexity and lexical diversity. For languages without linguistic annotation, the table shows only the wordform-based measure of lexical diversity (lexDivWord).

Lang	Collection	Number of		Thousands of			Lexical diversity		Syntactic complexity (average)
Lang	Collection	docs	texts	sentences	words	tokens	lexDivWord	lexDivLemma	sLength	subRatio	maxTreeDepth	maxNPLength	maxNPDepth	mdd
af	Subtitles	1	24	23.0	134.6	161.7	406.4	347.2	5.887	1.093	0.095	2.377	0.811	2.251
ar	Core-fiction	2	2	2.1	28.8	35.6	620.3	576.6	13.830	2.712	1.310	5.293	2.016	2.817
	Core-misc	1	1	1.3	5.5	7.4	451.4	421.4	4.150	1.330	0.290	1.870	0.840	2.010
	Subtitles	1	34 193	28 726.4	126 195.5	157 188.9	592.8	557.3	4.421	1.338	0.336	2.216	0.986	1.678
	Syndicate	3	433	19.0	384.5	439.0	622.7	560.3	20.513	2.485	1.312	11.036	3.940	2.405
be	Core-fiction	104	104	625.1	7 068.7	8 978.9	615.4	492.7	11.583	1.865	0.804	4.122	1.436	2.316
be	Core-misc	4	4	7.6	57.7	76.0	556.2	425.6	7.608	1.672	0.605	2.870	1.002	2.254
bg	Core-fiction	87	87	559.6	7 067.3	8 597.7	548.3	439.5	13.125	1.728	0.732	4.255	1.532	2.497
	Acquis	1	10 846	862.3	13 582.3	16 991.2	392.4	306.3	18.073	1.801	0.514	9.389	2.805	3.265
	Europarl	1	45 271	408.3	9 082.0	10 379.8	498.4	386.3	23.014	2.538	1.263	10.961	3.402	2.581
	Subtitles	1	40 986	32 591.1	164 644.1	214 988.4	518.2	384.6	5.089	1.336	0.322	1.861	0.706	1.931
bn	Subtitles	1	252	363.8	1 517.7	2 072.1	419.4	–	–	–	–	–	–	–
br	Subtitles	1	27	19.7	97.4	145.2	363.5	–	–	–	–	–	–	–
bs	Subtitles	1	14 208	12 165.3	56 465.9	75 945.3	450.2	–	–	–	–	–	–	–
ca	Core-fiction	91	91	678.0	9 951.3	11 363.4	471.6	375.2	15.579	2.140	0.962	6.099	1.920	2.551
	Core-misc	1	1	0.7	9.7	11.2	463.7	362.5	14.300	2.040	0.930	5.850	1.880	2.520
	Bible	2	66	50.3	728.2	839.4	405.3	308.0	15.729	2.056	0.912	6.460	2.103	2.602
	Subtitles	1	670	472.8	2 692.1	3 403.2	487.0	346.8	5.726	1.379	0.352	2.617	0.926	2.028
cs	Core-fiction	1 629	1.629	9 979.9	113 632.3	141 075.8	629.8	484.2	11.722	1.702	0.723	4.078	1.459	2.486
	Core-nonfict	113	113	488.9	8 412.5	10 107.3	649.3	501.8	18.099	2.107	1.004	8.159	2.685	2.607
	Core-misc	70	70	222.5	2 637.1	3 208.3	639.0	492.3	12.264	1.721	0.704	5.105	1.778	2.412
	Acquis	1	19 269	1 351.5	19 188.9	25 140.4	472.1	346.5	16.575	1.745	0.536	9.788	2.858	3.025
	Bible	2	66	51.0	562.5	692.9	537.1	372.0	11.907	1.603	0.635	4.125	1.590	2.451
	Europarl	1	69 482	685.3	12 918.7	15 030.4	600.9	435.0	19.380	2.428	1.256	9.361	3.180	2.527
	PressEurop	7	7 060	170.0	2 313.3	2 786.6	669.3	522.4	14.002	1.895	0.810	7.023	2.498	2.457
	Subtitles	1	60 619	48 207.7	232 969.1	313 262.9	589.7	406.3	4.866	1.307	0.319	1.862	0.694	1.971
	Syndicate	21	6 117	264.0	4 718.6	5 496.6	655.9	506.1	18.410	2.162	1.059	8.528	2.975	2.552
da	Core-fiction	90	90	685.3	9 460.8	11 273.9	464.6	388.7	14.334	1.712	0.694	4.949	1.649	2.514
	Core-nonfict	1	1	2.7	56.0	64.2	447.6	364.4	21.690	2.250	1.070	9.140	2.900	2.670
	Core-misc	2	2	0.8	11.9	14.2	441.6	363.4	14.515	1.714	0.728	5.350	1.836	2.466
	Acquis	1	18 263	1 566.7	20 014.9	25 402.6	395.0	333.1	14.462	1.647	0.485	8.314	2.491	2.762
	Bible	2	66	46.1	655.2	782.3	389.8	318.7	18.349	1.970	0.843	5.542	1.828	2.811
	Europarl	1	67 202	721.6	13 800.4	15 775.5	448.2	376.6	19.372	2.025	0.910	9.165	2.947	2.597
	Subtitles	1	15 985	13 559.9	71 590.8	92 880.6	438.1	346.4	5.338	1.184	0.190	1.985	0.701	1.925
de	Core-fiction	412	412	2 603.0	35 653.3	43 380.5	515.3	421.1	14.176	1.775	0.702	4.819	1.458	3.095
	Core-nonfict	43	43	205.6	4 037.3	4 754.4	525.3	434.8	20.302	2.015	0.862	8.836	2.456	3.384
	Core-misc	16	16	63.4	1 066.1	1 255.8	515.6	425.5	17.694	1.942	0.817	7.345	2.138	3.219
	Acquis	1	18 782	1 451.4	20 716.9	26 206.7	407.9	343.3	16.124	1.506	0.388	9.197	2.496	3.519
	Bible	2	66	49.2	725.0	854.0	395.2	302.4	15.637	1.648	0.657	5.263	1.737	2.998
	Europarl	1	62 391	661.2	13 156.2	15 169.0	487.1	396.8	20.448	2.074	0.914	9.361	2.646	3.473
	PressEurop	7	6 909	175.9	2 506.5	3 013.6	545.0	456.2	14.623	1.702	0.621	6.859	2.124	3.123
	Subtitles	1	21 322	18 354.4	98 808.9	129 234.7	489.6	380.9	5.414	1.240	0.231	2.119	0.712	2.271
	Syndicate	21	5 814	263.6	5 103.7	5 905.3	541.1	453.0	19.817	2.000	0.867	8.766	2.590	3.380
el	Acquis	1	18 904	1 432.0	23 684.5	28 955.7	409.0	313.2	17.722	1.884	0.707	10.688	2.957	2.690
	Europarl	1	68 069	623.6	15 381.7	17 233.2	488.5	366.9	25.498	2.664	1.379	12.485	3.413	2.682
	Subtitles	1	38 711	31 118.9	161 856.7	208 587.8	516.7	376.8	6.335	1.613	0.519	2.566	0.881	2.083
en	Core-fiction	366	366	2 701.0	36 519.3	43 557.4	466.2	403.2	14.159	2.107	0.945	5.371	1.689	2.576
	Core-nonfict	39	39	216.2	4 618.7	5 302.9	466.7	412.4	22.976	2.623	1.292	10.373	2.893	2.793
	Core-misc	17	17	53.4	778.3	905.9	455.8	393.7	15.091	2.160	0.967	6.561	1.987	2.557
	Acquis	1	18 930	1 327.2	23 062.9	28 075.3	346.1	307.3	20.073	2.193	0.806	11.086	2.912	3.176
	Bible	2	66	47.5	727.6	843.4	354.0	296.2	17.458	2.166	1.051	6.271	2.125	2.608
	Europarl	1	69 283	680.9	15 593.0	17 455.0	411.9	362.9	23.743	2.692	1.402	11.274	3.135	2.736
	PressEurop	7	7 019	152.5	2 663.8	3 107.7	485.4	431.4	18.016	2.286	1.033	8.828	2.614	2.689
	Subtitles	1	55 657	49 130.9	267 843.8	344 553.0	445.1	362.4	5.491	1.401	0.372	2.273	0.811	2.067
	Syndicate	21	6 113	263.1	5 272.8	6 090.3	494.2	438.7	20.792	2.447	1.186	9.516	2.843	2.733
eo	Subtitles	1	46	48.4	221.0	305.4	384.4	–	–	–	–	–	–	–
es	Core-fiction	338	338	1 981.3	29 664.1	34 294.8	495.7	400.3	15.586	2.176	0.974	6.243	1.919	2.574
	Core-nonfict	10	10	29.6	830.9	932.4	446.5	361.9	29.055	2.939	1.456	13.399	3.468	2.797
	Core-misc	7	7	15.0	165.1	198.8	475.1	370.6	11.674	1.781	0.662	4.887	1.575	2.382
	Acquis	1	19 056	1 333.1	26 269.3	31 277.0	348.0	290.7	22.339	1.851	0.588	12.954	3.099	3.098
	Europarl	1	67 754	660.7	16 248.5	18 032.0	437.6	353.4	25.496	2.614	1.350	12.798	3.348	2.618
	PressEurop	7	6 891	154.6	2 857.8	3 268.7	478.0	399.9	18.995	2.144	0.940	9.483	2.729	2.567
	Subtitles	1	50 705	40 849.5	223 006.0	293 901.1	498.6	355.5	5.499	1.404	0.373	2.378	0.862	1.972
	Syndicate	21	6 037	256.4	6 070.2	6 759.4	462.1	384.1	24.411	2.437	1.189	11.558	3.194	2.675
et	Core-fiction	1	1	6.7	78.8	96.1	626.8	478.0	11.790	2.020	0.920	4.200	1.540	2.530
	Acquis	1	18 727	1 349.8	14 884.2	19 414.5	543.8	404.0	13.084	2.744	0.961	6.654	2.304	2.972
	Europarl	1	68 478	704.3	10 898.7	12 761.7	635.2	463.0	15.935	2.687	1.347	7.271	2.669	2.517
	Subtitles	1	13 503	11 843.3	54 487.7	72 454.4	575.2	386.4	4.625	1.284	0.281	1.616	0.600	1.967
eu	Subtitles	1	652	732.9	2 999.9	4 039.0	600.9	401.1	4.112	1.280	0.265	1.371	0.522	1.745
fa	Subtitles	1	6 556	6 594.8	32 635.9	38 097.3	520.5	472.5	4.973	1.368	0.338	2.363	0.974	2.301
fi	Core-fiction	106	106	661.7	6 714.9	8 221.3	683.9	507.1	10.287	1.844	0.806	3.437	1.295	2.279
	Core-nonfict	4	4	14.4	200.5	237.0	685.3	489.0	14.336	2.401	1.208	5.977	2.378	2.435
	Core-misc	2	2	3.5	44.4	52.2	733.0	532.9	12.820	2.148	1.051	4.791	1.821	2.385
	Acquis	1	18 563	1 310.5	15 264.2	19 702.1	556.9	380.4	13.209	2.369	0.886	6.990	2.588	2.647
	Bible	2	66	48.0	542.6	675.3	529.0	351.4	13.324	1.911	0.871	4.231	1.534	2.511
	Europarl	1	67 019	675.6	10 109.3	11 838.6	670.8	462.7	15.260	2.483	1.242	6.924	2.670	2.395
	Subtitles	1	30 900	23 262.2	90 481.8	124 969.7	666.5	444.7	3.909	1.244	0.242	1.404	0.513	1.689
fr	Core-fiction	230	230	1 277.5	20 454.4	23 802.5	471.0	377.5	16.762	2.156	0.998	6.617	1.994	2.685
	Core-nonfict	37	37	152.5	3 687.5	4 206.8	456.2	373.9	26.628	2.938	1.451	12.424	3.202	2.807
	Core-misc	10	10	20.0	194.3	229.5	443.7	336.9	9.973	1.703	0.614	4.205	1.321	2.427
	Acquis	1	19 057	1 338.5	26 298.4	31 764.2	353.5	289.2	22.521	2.416	0.946	13.347	3.212	3.144
	Bible	2	66	50.6	762.6	886.3	384.9	285.9	17.822	2.060	0.893	6.743	2.171	2.758
	Europarl	1	68 220	677.7	17 186.4	18 984.0	425.6	338.2	26.070	2.866	1.565	13.013	3.423	2.638
	PressEurop	7	7 025	163.8	3 044.3	3 510.4	476.4	396.4	19.097	2.279	1.036	9.836	2.826	2.606
	Subtitles	1	38 341	30 038.8	181 033.4	225 399.3	453.5	325.6	6.061	1.405	0.394	2.563	0.926	2.031
	Syndicate	21	5 585	238.3	5 893.7	6 542.1	457.8	379.9	25.332	2.742	1.410	12.251	3.308	2.698
gl	Subtitles	1	146	121.7	622.1	797.9	529.5	411.1	5.144	1.339	0.323	2.602	0.940	1.958
he	Subtitles	1	33 935	27 608.8	129 458.6	172 973.7	549.8	479.7	4.747	1.370	0.346	2.637	1.064	1.918
hi	Core-fiction	7	7	35.6	402.8	462.1	449.6	348.6	11.386	1.586	0.524	4.610	1.625	2.692
hi	Subtitles	1	54	81.0	429.9	526.0	401.1	324.0	5.336	1.156	0.146	2.358	0.838	2.190
hr	Core-fiction	292	292	1 822.5	22 763.6	27 339.1	591.4	460.5	12.720	1.902	0.837	4.247	1.472	2.623
	Core-nonfict	22	22	72.0	1 523.4	1 742.7	600.4	451.3	21.425	2.682	1.368	9.323	2.963	2.718
	Core-misc	10	10	19.6	242.6	298.0	570.4	431.2	12.616	2.062	0.909	4.645	1.536	2.570
	Bible	2	66	48.1	569.9	686.1	519.1	381.3	12.989	1.855	0.773	4.359	1.599	2.504
	Subtitles	1	35 057	28 796.3	137 844.3	178 347.5	566.6	421.2	4.795	1.392	0.373	1.814	0.681	1.929
hs	Core-fiction	8	8	36.2	405.3	512.1	503.3	–	–	–	–	–	–	–
	Core-nonfict	1	1	1.9	24.4	29.5	571.5	–	–	–	–	–	–	–
	Core-misc	4	4	3.5	36.6	44.7	513.6	–	–	–	–	–	–	–
hu	Core-fiction	87	87	573.2	6 890.1	8 657.7	603.6	499.1	12.888	1.709	0.706	3.698	1.367	2.759
	Core-misc	2	2	6.1	28.9	39.5	568.2	457.3	4.817	1.269	0.254	1.817	0.650	2.100
	Acquis	1	18 539	1 290.2	17 851.3	22 815.8	485.6	385.2	16.126	1.825	0.515	7.743	2.832	3.421
	Europarl	1	66 229	677.3	12 187.9	14 266.5	591.1	469.4	18.625	2.202	1.013	7.465	2.741	2.799
	Subtitles	1	41 067	31 962.7	141 559.0	194 622.6	586.7	466.0	4.609	1.261	0.268	1.644	0.627	1.859
	Syndicate	3	9	0.5	8.4	9.8	598.4	481.5	16.869	2.080	0.933	6.351	2.436	2.685
hy	Subtitles	1	7	3.9	23.5	30.6	601.7	445.9	6.057	1.375	0.382	2.179	0.860	2.075
id	Subtitles	1	8 350	8 112.7	37 824.9	49 694.7	475.7	401.9	4.699	1.344	0.317	2.343	0.911	1.742
is	Subtitles	1	1 135	1 497.9	7 374.2	9 299.9	503.5	369.4	4.951	1.233	0.233	1.913	0.699	1.841
it	Core-fiction	164	164	1 205.7	17 435.8	20 566.1	529.6	414.7	15.157	2.092	0.973	6.471	1.970	2.578
	Core-nonfict	5	5	22.4	647.8	738.9	486.6	389.2	31.080	3.082	1.564	16.597	3.877	2.931
	Core-misc	2	2	4.0	50.6	61.7	505.7	378.9	14.351	2.299	1.040	5.722	1.817	2.633
	Acquis	1	18 893	1 345.7	23 892.0	29 413.1	390.7	306.5	20.391	2.112	0.766	13.152	3.242	3.156
	Bible	2	65	47.3	685.2	806.6	421.8	317.0	16.561	1.969	0.881	6.739	2.168	2.723
	Europarl	1	69 139	650.3	15 511.4	17 235.8	486.8	381.6	24.916	2.686	1.409	13.989	3.644	2.603
	PressEurop	7	7 024	156.3	2 750.7	3 155.3	524.2	421.2	18.041	2.121	0.943	9.814	2.803	2.553
	Subtitles	1	37 721	29 870.5	163 859.9	212 801.7	532.8	384.1	5.518	1.325	0.319	2.535	0.903	2.008
	Syndicate	11	1 388	58.9	1 391.5	1 564.2	504.3	403.4	24.516	2.535	1.261	12.837	3.463	2.682
ja	Core-fiction	33	33	201.2	3 766.7	4 262.0	365.4	336.3	18.928	3.094	1.432	8.630	2.666	2.697
	Core-nonfict	1	1	7.0	163.1	184.2	361.2	334.5	23.420	3.540	1.720	11.650	3.490	2.810
	Core-misc	1	1	2.1	64.9	75.9	280.9	257.8	31.520	4.300	1.990	16.990	4.490	3.160
	Subtitles	1	2 326	2 086.3	12 141.5	13 495.4	381.9	348.5	6.212	1.417	0.375	3.221	1.312	1.909
	Syndicate	1	2	0.1	2.5	2.9	385.4	372.0	38.705	4.330	2.015	20.881	4.923	3.215
ka	Subtitles	1	204	198.4	871.1	1 179.0	380.8	–	–	–	–	–	–	–
kk	Subtitles	1	4	4.1	13.9	19.2	657.7	607.3	3.389	1.243	0.247	1.761	0.892	1.603
ko	Subtitles	1	1 605	1 641.1	5 964.3	7 294.3	690.6	686.3	3.682	1.529	0.457	1.146	0.440	1.785
lt	Core-fiction	20	20	61.4	669.1	842.6	685.2	545.5	11.223	1.901	0.813	3.957	1.479	2.487
	Core-nonfict	1	1	1.3	17.4	23.1	657.2	492.4	14.180	2.190	0.930	6.670	2.230	2.600
	Core-misc	2	2	1.2	7.2	9.0	764.9	628.4	6.184	1.430	0.409	2.921	1.136	1.887
	Acquis	1	18 809	1 477.8	17 175.1	22 835.1	515.0	346.3	13.456	2.504	0.938	6.985	2.531	2.867
	Bible	2	66	46.1	471.2	596.3	550.9	439.8	10.822	1.668	0.706	3.866	1.500	2.281
	Europarl	1	67 719	688.5	11 198.5	13 475.2	627.2	441.4	16.816	3.016	1.607	7.683	2.906	2.469
	Subtitles	1	1 025	1 345.9	5 247.7	7 353.0	624.6	461.7	3.923	1.278	0.286	1.552	0.569	1.760
lv	Core-fiction	65	65	291.9	3 207.6	4 032.0	639.2	494.8	11.339	1.758	0.756	3.605	1.343	2.563
	Core-nonfict	1	1	3.3	66.9	89.0	680.0	541.3	21.810	2.310	1.070	9.480	2.800	2.910
	Core-misc	7	7	30.0	362.1	440.5	688.1	543.4	12.147	1.759	0.776	4.397	1.668	2.337
	Acquis	1	18 348	1 486.3	17 519.4	23 361.6	490.0	340.4	13.790	2.296	0.831	7.109	2.492	2.865
	Bible	2	66	40.1	536.7	671.7	495.5	343.1	13.645	1.663	0.754	4.180	1.602	2.658
	Europarl	1	67 482	683.7	11 682.0	13 896.8	590.6	416.3	17.627	2.434	1.255	7.884	2.853	2.497
	Subtitles	1	387	488.4	2 050.4	2 801.9	592.2	425.9	4.227	1.269	0.264	1.568	0.592	1.811
mk	Core-fiction	104	104	694.6	8 794.5	10 571.7	464.3	–	–	–	–	–	–	–
	Core-misc	4	4	12.1	86.5	109.3	422.0	–	–	–	–	–	–	–
	Subtitles	1	3 433	3 201.0	15 112.0	20 217.5	412.3	–	–	–	–	–	–	–
ml	Subtitles	1	285	365.3	1 258.4	1 793.5	489.8	–	–	–	–	–	–	–
ms	Subtitles	1	1 496	1 712.1	7 828.0	10 573.3	371.2	–	–	–	–	–	–	–
mt	Acquis	1	8 963	784.8	13 805.0	16 643.6	373.4	1.0	20.381	2.683	1.141	11.437	3.347	2.933
nl	Core-fiction	194	194	1 152.0	17 229.8	19 889.7	466.9	403.0	15.424	2.149	0.959	5.255	1.558	3.176
	Core-nonfict	12	12	50.6	1 193.5	1 336.2	449.2	391.5	25.698	2.909	1.375	10.658	2.784	3.453
	Core-misc	9	9	27.2	356.4	413.4	463.6	395.9	13.450	1.993	0.860	5.102	1.550	2.981
	Acquis	1	18 975	1 483.9	23 401.1	28 140.1	356.2	317.3	18.005	2.217	0.766	9.491	2.375	3.553
	Bible	2	66	45.8	716.8	821.3	386.8	326.1	17.940	2.264	1.067	5.942	1.936	3.042
	Europarl	1	67 139	693.8	15 555.9	17 074.8	425.7	371.8	22.952	2.500	1.217	10.132	2.744	3.274
	PressEurop	7	7 009	175.4	2 952.8	3 337.6	483.7	429.2	17.267	2.172	0.967	7.879	2.300	3.107
	Subtitles	1	38 546	29 399.1	170 892.9	212 492.5	444.4	354.8	5.847	1.485	0.439	2.180	0.728	2.291
	Syndicate	5	841	37.6	812.1	897.2	477.2	422.9	22.671	2.570	1.225	10.005	2.810	3.282
no	Core-fiction	91	91	558.7	7 690.7	9 028.3	461.6	383.6	14.346	1.842	0.849	4.850	1.579	2.599
	Core-nonfict	5	5	17.0	392.0	439.5	467.7	381.2	24.705	2.716	1.366	11.077	3.100	2.842
	Core-misc	6	6	10.7	138.1	163.5	450.0	372.8	14.035	1.835	0.794	5.029	1.509	2.619
	Bible	2	66	55.3	723.9	831.4	364.6	294.7	13.099	1.573	0.620	4.645	1.713	2.447
	Subtitles	1	8 995	7 702.8	39 805.6	50 657.6	448.4	353.0	5.188	1.299	0.298	1.960	0.697	1.917
pl	Core-fiction	328	328	2 400.0	27 056.2	33 548.9	632.2	499.6	11.498	1.896	0.833	4.162	1.523	2.355
	Core-nonfict	11	11	36.6	754.2	897.7	613.7	460.1	20.825	2.743	1.407	9.385	3.192	2.509
	Core-misc	9	9	24.4	283.2	345.5	622.9	471.6	12.263	1.981	0.881	4.978	1.892	2.266
	Acquis	1	19 024	1 657.3	19 482.9	24 945.6	481.4	350.6	13.373	2.035	0.714	7.737	2.681	2.622
	Bible	2	66	48.2	576.1	712.9	537.0	387.8	12.695	1.724	0.727	4.479	1.725	2.397
	Europarl	1	67 443	713.3	12 662.8	14 667.8	607.5	447.2	18.340	2.643	1.309	9.387	3.283	2.322
	PressEurop	7	6 999	166.6	2 367.5	2 879.1	659.8	520.6	14.632	2.143	0.957	7.092	2.645	2.334
	Subtitles	1	46 175	36 236.0	164 059.8	222 210.4	602.1	441.5	4.556	1.324	0.319	1.855	0.717	1.832
pt	Core-fiction	82	82	519.8	7 204.0	8 608.5	511.2	408.1	14.436	2.299	1.142	6.372	2.041	2.497
	Core-misc	5	5	6.9	81.3	96.0	495.3	388.9	12.461	2.159	0.977	6.238	1.780	2.486
	Acquis	1	18 934	1 356.4	24 385.0	29 549.7	377.3	305.5	20.372	2.488	0.967	12.971	3.327	3.020
	Bible	2	66	54.3	706.2	840.4	380.3	293.5	19.149	2.385	1.111	7.620	2.305	2.957
	Europarl	1	65 92	648.7	15 188.4	17 127.0	467.5	379.1	24.202	3.093	1.726	13.821	3.724	2.591
	PressEurop	7	6 967	160.9	2 782.5	3 286.5	507.4	422.0	17.848	2.388	1.150	10.138	2.951	2.536
	Subtitles	1	54 342	43 730.9	229 480.2	294 774.7	495.5	360.1	5.278	1.449	0.432	2.528	0.955	1.939
	Syndicate	8	747	32.4	738.5	839.0	489.9	405.1	23.875	2.980	1.575	12.669	3.544	2.646
rn	Core-fiction	1	1	1.1	8.4	11.1	424.3	–	–	–	–	–	–	–
rn	Core-misc	1	1	0.7	5.2	6.6	416.4	–	–	–	–	–	–	–
ro	Core-fiction	44	44	233.3	4 132.6	4 833.2	534.2	406.3	18.106	2.262	1.146	6.360	2.019	2.604
	Core-misc	1	1	2.7	64.1	74.2	539.5	414.1	23.970	2.690	1.500	10.330	2.910	2.680
	Acquis	1	6 318	650.0	8 043.5	9 884.4	405.3	301.4	14.150	2.221	0.770	7.930	2.544	2.900
	Europarl	1	44 143	406.6	9 426.4	10 585.4	499.1	368.7	23.966	2.798	1.517	11.591	3.558	2.484
	PressEurop	7	6 991	160.6	2 725.2	3 192.6	546.7	429.5	17.486	2.219	1.017	8.508	2.772	2.492
	Subtitles	1	45 407	38 108.1	211 310.4	266 731.5	509.0	351.2	5.572	1.388	0.383	2.129	0.795	1.954
ru	Core-nonfict	10	10	30.6	518.7	625.2	645.0	495.9	17.765	2.613	1.223	8.126	2.801	2.603
	Core-fiction	144	144	1 043.5	11 757.6	14 913.7	633.0	501.9	11.643	1.959	0.865	4.203	1.557	2.386
	Core-misc	6	6	12.8	143.8	180.7	633.2	484.5	11.439	1.947	0.870	4.378	1.718	2.265
	Bible	2	66	39.0	565.5	703.9	486.6	346.2	20.730	2.746	1.302	6.198	2.121	2.828
	Subtitles	1	27 195	21 625.8	104 831.9	141 586.8	574.9	428.1	4.878	1.423	0.401	1.930	0.744	1.887
	Syndicate	21	5 418	233.5	4 312.8	5 110.5	637.5	487.3	19.037	2.653	1.288	9.232	3.298	2.424
si	Subtitles	1	499	522.5	2 313.4	3 021.8	443.6	–	–	–	–	–	–	–
sk	Core-fiction	142	142	706.0	7 626.6	9 513.5	617.0	480.8	10.845	1.562	0.612	3.503	1.284	2.620
	Core-nonfict	10	10	39.1	558.0	687.3	650.0	517.1	14.785	1.516	0.547	6.760	2.344	2.518
	Core-misc	13	13	32.4	402.2	496.9	652.5	515.7	12.636	1.564	0.555	5.338	1.707	2.493
	Acquis	1	18 302	1 363.0	18 398.8	23 542.1	482.7	353.1	15.458	1.732	0.516	8.677	2.746	3.029
	Bible	2	65	46.9	560.8	690.8	520.0	373.4	12.716	1.615	0.662	4.178	1.576	2.567
	Europarl	1	67 731	677.8	12 727.0	14 735.3	595.1	433.8	19.150	2.344	1.172	9.020	3.065	2.538
	Subtitles	1	8 322	7 214.8	34 589.4	46 215.1	575.9	411.5	4.821	1.293	0.295	1.835	0.674	1.975
sl	Core-fiction	71	71	370.5	4 611.2	5 686.2	556.5	428.7	12.704	2.096	0.857	4.122	1.374	2.641
	Core-nonfict	1	1	1.1	22.4	24.9	656.4	528.9	21.090	1.980	0.830	8.840	2.930	2.890
	Core-misc	1	1	0.7	6.1	7.4	682.1	585.6	8.950	1.720	0.650	4.410	1.720	2.210
	Acquis	1	17 414	1 399.2	18 510.4	24 069.9	466.2	335.6	15.345	1.810	0.580	8.359	2.683	2.841
	Europarl	1	65 366	649.6	12 249.8	14 263.6	564.3	405.6	19.433	2.551	1.254	9.220	3.066	2.539
	Subtitles	1	21 607	18 080.2	83 057.1	111 736.8	568.0	399.0	4.620	1.333	0.309	1.726	0.625	1.899
sq	Subtitles	1	1 575	1 769.0	9 171.4	12 098.4	395.5	–	–	–	–	–	–	–
sr	Core-fiction	143	143	931.6	12 556.0	15 029.8	584.7	462.0	13.767	1.956	0.898	4.690	1.601	2.638
	Core-nonfict	2	2	5.9	119.3	138.9	565.0	417.2	20.654	2.876	1.518	8.918	2.889	2.655
	Core-misc	3	3	5.0	29.3	38.9	538.0	411.7	5.882	1.394	0.371	2.405	0.906	2.215
	Subtitles	1	38 029	31 175.3	152 425.6	196 520.1	561.3	445.3	4.901	1.338	0.333	1.905	0.722	1.938
sv	Core-fiction	208	208	1 398.8	18 011.7	20 456.7	490.6	403.5	13.175	1.944	0.848	4.403	1.445	2.501
	Core-nonfict	16	16	64.9	1 273.0	1 403.1	508.2	415.4	19.801	2.541	1.288	7.980	2.435	2.683
	Core-misc	8	8	28.5	454.8	512.3	490.1	404.4	16.027	2.123	1.026	5.575	1.790	2.561
	Acquis	1	17 133	1 285.5	19 443.0	23 283.7	402.1	327.7	16.286	1.913	0.705	8.700	2.448	2.784
	Bible	2	66	43.9	637.9	731.7	414.2	323.2	14.907	1.947	0.895	4.760	1.703	2.542
	Europarl	1	67 898	720.6	13 777.6	15 146.8	461.9	374.1	19.313	2.381	1.183	8.221	2.554	2.640
	Subtitles	1	19 41	15 571.7	81 490.5	103 181.3	455.7	352.1	5.256	1.319	0.303	1.921	0.684	1.921
ta	Subtitles	1	20	29.4	104.0	141.8	511.8	434.1	3.562	1.196	0.171	1.673	0.639	1.807
te	Subtitles	1	18	26.0	96.0	127.1	496.5	1.0	3.806	1.324	0.284	1.746	0.658	2.086
th	Subtitles	1	3 932	3 457.0	5 626.0	7 288.3	658.1	–	–	–	–	–	–	–
tl	Subtitles	1	5	8.0	37.0	52.7	344.9	–	–	–	–	–	–	–
tr	Subtitles	1	44 015	35 975.7	147 635.3	199 108.2	670.1	424.8	4.133	1.259	0.257	1.929	0.853	1.815
uk	Core-fiction	192	192	1 260.0	14 478.3	18 490.6	626.6	506.4	11.923	2.047	0.892	4.187	1.507	2.377
	Core-nonfict	5	5	19.1	333.0	416.1	621.1	469.6	19.193	2.945	1.432	8.468	2.909	2.517
	Core-misc	2	2	4.0	38.9	50.3	614.9	484.8	9.801	1.851	0.774	3.366	1.282	2.254
	Bible	2	66	41.5	596.1	738.1	475.7	352.8	14.784	1.804	0.777	4.921	1.751	2.585
	Subtitles	1	1 006	813.4	3 779.0	5 123.2	571.4	461.9	4.684	1.360	0.334	1.853	0.710	1.897
ur	Subtitles	1	19	27.0	155.7	180.8	397.6	344.1	5.885	1.204	0.178	2.777	1.098	2.260
vi	Subtitles	1	3 468	3 304.5	19 281.4	23 984.0	446.3	403.8	5.931	1.508	0.458	2.351	0.945	1.849
zh	Core-fiction	3	3	11.7	215.4	253.9	382.0	376.8	18.467	4.655	1.684	4.099	1.594	3.435
	Subtitles	1	11 378	11 952.3	70 963.9	79 539.4	448.9	439.5	6.046	1.689	0.548	2.081	0.791	2.289
	Syndicate	5	654	29.7	675.9	766.7	493.8	489.5	23.166	4.110	1.795	7.026	2.391	3.366

Metadata

Metadata such as the text's title, author, or source language are available for most texts as attributes of structural elements such as text or sentence. To view the list of such attributes and to select those that should be displayed in the KonText query results, choose InterCorp 16ud, the relevant language, and then n the View menu select Corpus-specific settings and go to Structures or References.

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

The 13th corrected issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the Czech Biblical Society, especially Petr Fryš.
Fiction in many Slavic and some other languages from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
Political commentaries in a number of languages from the site Project Syndicate
Newspaper texts in a number of languages from the Presseurop/VoxEurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Proceedings of the European Parliament from the EuroParl corpus
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober
Film subtitles from the database Open Subtitles

Pre-processing

Parallel text editor InterText by Pavel Vondřička
Aligner Hunalign
Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit

Linguistic annotation

* UDPipe (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel)

References – about UD-annotated InterCorp

Rosen, A. (2024): Lexical and syntactic variability of languages and text genres – a corpus-based study. Recording from 14 October 2024: Natural Language Processing Seminar organised by the Linguistic Engineering Group at the Institute of Computer Science Polish Academy of Sciences, slides.

Olga Nádvorníková (2024): Analyse contrastive de la complexité syntaxique à l’aide de corpus parallèles. Translitteræ, Laboratoire LATTICE (Langues, Textes, Traitements informatiques et Cognition) – CNRS UMR 8094 (Centre national de la recherche scientifique: Unité mixte de recherche), ENS (L'École normale supérieure). Paris, 28/05/2024. Video, slides

Alexandr Rosen (2024): Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics. Instytut Slawistyki Zachodniej i Południowej, Uniwersytet Warszawski. Warszawa, 10/06/2024, slides.

Alexandr Rosen (2023). The InterCorp parallel corpus with a uniform annotation for all languages. Jazykovedný časopis, 74(1):254–265. Paper, slides.

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, František & Alexandr Rosen. 2012. The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics 13(3). 411–427. (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, Alexandr, Bohumil Šimčík, Martin Vavřín & Adrian Jan Zasina. 2024. The InterCorp Corpus – Czech³⁾, version 16ud of 17 September 2024. Institute of the Czech National Corpus, Charles University, Prague 2024. Available on-line: https://kontext.korpus.cz/

Table of Contents