InterCorp Release 16

InterCorp Release 16

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	154 512 254	363 685 460	464 653 933	5 840 602 221
Positions	Number of word forms	124 679 582	272 862 335	386 728 679	4 505 550 764
Structural attributes	Number of documents	1 812	33	4 643	338
	Number of texts	1 812	162 612	4 643	2 662 665
	Number of sentences	10 691 339	50 729 559	28 684 678	790 046 584
Further information	reference	YES
	representative	NO
	publication date	2023
	foreign languages	61
	tagged languages	27
	lemmatized languages	25

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Alexandr Rosen if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).

Texts in the corpus

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

Political commentaries published by Project Syndicate and VoxEurop (formerly PressEurop)
A package of legal texts of the European Union form the Acquis Communautaire corpus
Proceedings of the European Parliament dated 2007–2011 from the Europarl corpus
Film subtitles from the Open Subtitles database
Translations of the Bible

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16 published in October 2023 is 387 mil. words in the aligned foreign language texts in the core part and 4 506 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in thousands of words

Language		Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Bible	Total
af	Afrikaans	0	0	0	0	0	136	0	136
ar	Arabic	34	384	0	0	0	126 157	0	126 576
be	Belarusian	7 131	0	0	0	0	0	0	7 131
bg	Bulgarian	7 068	0	0	13 577	9 083	165 092	0	194 820
bn	Bengali	0	0	0	0	0	1 554	0	1 554
br	Breton	0	0	0	0	0	98	0	98
bs	Bosnian	0	0	0	0	0	58 758	0	58 758
ca	Catalan	10 112	0	0	0	0	2 735	736	13 582
cs	Czech	124 680	4 717	2 312	19 214	12 917	233 139	563	397 542
da	Danish	9 548	0	0	20 313	13 916	71 825	657	116 259
de	German	40 679	5 067	2 483	20 610	13 089	98 566	724	181 219
el	Greek	0	0	0	23 853	15 404	162 561	0	201 818
en	English	42 395	5 273	2 670	22 902	15 576	280 335	730	369 882
eo	Esperanto	0	0	0	0	0	226	0	226
es	Spanish	30 661	6 074	2 859	26 262	16 249	223 134	0	305 240
et	Estonian	79	0	0	14 896	10 899	54 514	0	80 388
eu	Basque	0	0	0	0	0	3 022	0	3 022
fa	Persian	0	0	0	0	0	33 167	0	33 167
fi	Finnish	6 959	0	0	15 269	10 108	90 471	543	123 349
fr	French	24 361	5 896	3 046	26 200	17 179	181 433	764	258 879
gl	Galician	0	0	0	0	0	623	0	623
he	Hebrew	0	0	0	0	0	130 143	0	130 143
hi	Hindi	409	0	0	0	0	432	0	841
hr	Croatian	24 529	0	0	0	0	137 966	571	163 066
hs	Upper Sorbian	466	0	0	0	0	0	0	466
hu	Hungarian	6 921	8	0	17 852	12 198	141 691	0	178 670
hy	Armenian	0	0	0	0	0	24	0	24
id	Indonesian	0	0	0	0	0	38 343	0	38 343
is	Icelandic	0	0	0	0	0	7 375	0	7 375
it	Italian	18 086	1 389	2 747	23 771	15 494	163 622	684	225 793
ja	Japanese	3 818	2	0	0	0	12 485	0	16 305
ka	Georgian	0	0	0	0	0	889	0	889
kk	Kazakh	0	0	0	0	0	14	0	14
ko	Korean	0	0	0	0	0	5 980	0	5 980
lt	Lithuanian	696	0	0	17 316	11 213	5 269	471	34 964
lv	Latvian	3 636	0	0	17 533	11 682	2 053	537	35 441
mk	Macedonian	8 881	0	0	0	0	15 595	0	24 476
ml	Malayalam	0	0	0	0	0	1 281	0	1 281
ms	Malay	0	0	0	0	0	7 939	0	7 939
mt	Maltese	0	0	0	13 935	0	0	0	13 935
nl	Dutch	18 782	812	2 953	23 416	15 558	170 979	717	233 217
no	Norwegian	8 221	0	0	0	0	39 807	724	48 752
pl	Polish	28 597	0	2 380	19 604	12 817	169 498	583	233 480
pt	Portuguese	7 285	739	2 782	24 598	15 193	229 515	706	280 818
rn	Romani	14	0	0	0	0	0	0	14
ro	Romanian	4 219	0	2 738	8 092	9 446	212 396	0	236 890
ru	Russian	12 387	4 302	0	0	0	104 609	565	121 864
si	Sinhala	0	0	0	0	0	2 346	0	2 346
sk	Slovak	8 586	0	0	18 399	12 727	34 581	561	74 854
sl	Slovene	4 636	0	0	18 515	12 241	83 000	0	118 392
sq	Albanian	0	0	0	0	0	9 351	0	9 351
sr	Serbian	12 706	0	0	0	0	152 636	0	165 342
sv	Swedish	19 740	0	0	19 542	13 784	81 548	638	135 252
ta	Tamil	0	0	0	0	0	104	0	104
te	Telugu	0	0	0	0	0	96	0	96
th	Thai	0	0	0	0	0	5 660	0	5 660
tl	Tagalog	0	0	0	0	0	38	0	38
tr	Turkish	0	0	0	0	0	149 892	0	149 892
uk	Ukraininan	14 849	0	0	0	0	2 938	596	18 382
ur	Urdu	0	0	0	0	0	158	0	158
vi	Vietnamese	0	0	0	0	0	22 298	0	22 298
zh	Chinese	238	838	0	0	0	71 331	0	72 407
TOTAL		511 408	35 503	26 971	425 670	276 772	4 001 428	12 069	5 289 821

N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Number of texts in the Core

Language		Number of texts	including originals
ar	Arabic	3	1
be	Belarusian	108	14
bg	Bulgarian	87	19
ca	Catalan	92	1
cs	Czech	1 812	368
da	Danish	93	9
de	German	471	163
en	English	422	271
es	Spanish	355	142
et	Estonian	1	0
fi	Finnish	112	36
fr	French	277	126
hi	Hindi	7	2
hr	Croatian	324	37
hs	Upper Sorbian	13	5
hu	Hungarian	89	1
it	Italian	171	26
ja	Japanese	35	15
lt	Lithuanian	23	4
lv	Latvian	73	15
mk	Macedonian	108	4
nl	Dutch	215	52
no	Norwegian	102	23
pl	Polish	348	54
pt	Portuguese	87	24
rn	Romani	2	2
ro	Romanian	45	5
ru	Russian	160	37
sk	Slovak	165	62
sl	Slovene	73	25
sr	Serbian	148	13
sv	Swedish	232	101
uk	Ukrainian	199	8
zh	Chinese	3	3
TOTAL		6 455	1 668

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation. The format and often even the meaning of categories encoded in the morphosyntactic tags differs in most languages. Thus for each tagged language we provide a link to the tagset description. After selecting CQL as the query type, the tagset description is available also from the KonText search interface.

Language	Tags	Lemmas	Brief description	Detailed description	Tags in the corpus	Tool
Belarusian	✔	✔	in English****)	in English****)	list	UDPipe
Bulgarian	✔	✔	in English	in English	list	TreeTagger
Catalan	✔	✔	in English		list	TreeTagger
Chinese	✔		in English	in English	list	ZPar v0.7.5
Croatian	✔	✔	in English	in English	list	ReLDI Tagger
Czech	✔	✔	in Czech and English	in English	list	Morče
Dutch	✔	✔	in English		list	TreeTagger
English	✔	✔	in English	in English + additions	list	TreeTagger
Estonian	✔	✔	in Estonian and English		list	TreeTagger
Finnish	✔	✔	in English*)	in English*)	list	OMorFi +HunPOS
French	✔	✔	in English		list	TreeTagger
German	✔	✔	in English **)	in German	list	RFTagger
Hungarian	✔			in English	list	RFTagger
Icelandic	✔	✔	in English	in English	list	IceStagger
Italian	✔	✔	in English		list	TreeTagger
Japanese	✔	✔	in English		list	MeCab + Unidic
Latvian	✔	✔	in Latvian		list	LVTagger
Norwegian	✔	✔	in English****)	in English****)	list	UDPipe
Polish	✔	✔	in English and Polish	in English	list	Morfeusz, KRNNT
Portuguese	✔	✔	in Spanish		list	TreeTagger
Russian	✔	✔	in English	in English ***)	list	TreeTagger
Slovak	✔	✔	in Slovak and English	in Slovak	list	Radovan Garabík, Morče
Slovene	✔	✔		in English	list	ReLDI Tagger
Serbian	✔	✔	in English	in English	list	ReLDI Tagger
Spanish	✔	✔	in English		list	TreeTagger
Swedish	✔	✔	in Swedish and English		list	Stagger
Ukrainian	✔	✔	in English****)	in English****)	list	UDPipe

*) The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].

**) Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.

***) Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.

****) The tag is in the UD (Universal Dependencies) format, components of the tag are separated by a vertical bar (|), e.g. the form школы in genitive singular is tagged as: NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing. The query can be specified in the same way as for other languages, treating the tag as a string, i.e.\ [tag="NOUN.*Case=Gen\|Gender=Fem.*"] or the tag components can be specified separately: [tag="Case=Gen" & tag="NOUN" & tag="Gender=Fem"] (the order of categories is not significant). The result is identical in either case.

Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. You can have all tags used in the corpus for a given language listed – see the column Tags in the corpus in the table above. Or in a page displaying results open the View/Corpus-specific settings… menu to check the tag option in the Positional attributes box and choose the for each token option in the Viewing options box.

Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.

Morphological tags including characters with a special meaning in regular expressions, e.g. $ in the English tag wp$, must be preceded in queries by a backslash: tag="wp\$".

Structural attributes

Structure	Attribute	Description	Values
doc	doc.id	document identifier	author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP_year / _SUBTITLES / _SYNDICATE_year / _OT / _NT
text	text.id	text identifier	author's_last_name-shortened_title:0 / _ACQUIS:number / _EUROPARL:number / _PRESSEUROP:number / _SUBTITLES:number / _SYNDICATE_year:name / _OT:book / _NT:book
	text.author	author	last name, first name
	text.title	full title	text
	text.lang	language	ar / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh
	text.version	version	number
	text.group	core/collection	Core / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible
	text.publisher	publisher	text
	text.pubplace	publication place	text
	text.pubDateYear	publication year	number
	text.pubDateMonth	publication month	number
	text.origyear	original creation year	number
	text.isbn	ISBN	number
	text.txtype	text type	discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles / religious
	text.comment	comment	text
	text.original	original version?	Yes / No
	text.srclang	language of the original	ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh
	text.translator	translator	last name, first name
	text.transsex	translator's gender	F / M
	text.authsex	author's gender	F / M
	text.transcomment	translation comment	text
	text.collectiontitle	collection title	text
	text.volume	volume number	number
	text.pages	number of pages	number
	text.lang_var	language variety	de-AT / de-CH / de-DE / en-AU / en-CA / en-GB / en-UM / en-US / es-ES / es-MX / es-PE / fr-BE / fr-FR / it-CH / it-IT / nl-BE / nl-NL / pt-BR / pt-PT / sr-RS
	text.wordcount	number of words	number
div	div.id	division identifier (Bible)	_NT / _OT:chapter
	div.type	division type	chapter
p	p.id	paragraph identifier	doc:text:div:par
s	s.id	sentence identifier	doc:text:div:par:sent
hi	hi.rend	typeface	italic / bold / bold italic
lb	lb.id	verse identifier (Bible)	book:chapter:verse

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the Czech Biblical Society, especially Petr Fryš.
Fiction in many Slavic and some other languages from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
Political commentaries in a number of languages from the site Project Syndicate
Newspaper texts in a number of languages from the Presseurop/VoxEurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Proceedings of the European Parliament from the EuroParl corpus
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober
Film subtitles from the database Open Subtitles

Pre-processing

Parallel text editor InterText by Pavel Vondřička
Aligner Hunalign
Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit

Taggers/lemmatizers:

MorfFlex, Morče and LanGr for Czech
TreeTagger for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish
Morfeusz and KRNNT for Polish
HunPOS for Hungarian and other languages
Tagger for Slovak (thanks to Radovan Garabík)
totale for Slovene (until Release 11, thanks to Tomaž Erjavec)
RFTagger for German
OMorFi+HunPOS for Finnish (thanks to Filip Ginter)
Stagger and IceStagger for Swedish and Icelandic (thanks to Robert Östling)
RelDI tagger for Croatian, Serbian¹⁾ and Slovene²⁾ (thanks to Nikola Ljubešić)
LVTagger for Latvian (thanks to Pēteris Paikens and Michal Škrabal)
UD Pipe for Belarusian and Ukrainian (thanks to Bohdan Moskalevskyi)
MeCab and Unidic for Japanese (thanks to Adam Nohejl)
ZPar for Chinese (thanks to Vlastimil Dobečka)

Known bugs

For some Finnish, Polish and Slovak texts in the Core part the value of the attribute doc.id is not displayed. This occurs when doc.id should be shown as metadata in references, structures and in document-based statistics. As a workaround, please use the text.id attribute instead. In the collections (Subtitles, Acquis, etc.) the doc.id attribute is shown as expected.

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M., Zasina, A. J. (2022). The InterCorp Corpus – Czech³⁾, version 16 of 11 November 2022. Institute of the Czech National Corpus, Charles University, Prague 2022. Available on-line: https://kontext.korpus.cz/

Table of Contents

InterCorp Release 16

Access to the texts

Texts in the corpus

Corpus size in thousands of words

Number of texts in the Core

Morphosyntactic annotation

Structural attributes

Acknowledgements

Texts:

Pre-processing

Taggers/lemmatizers:

Known bugs

How to cite

See also

Search

Navigation

Print/export

Tools

Languages

Licence