InterCorp Release 13ud – Universal Dependencies

InterCorp Release 13ud – Universal Dependencies

Name		Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	141,032,521	116,673,043	394,042,551	1,550,071,364
Positions	Number of word forms	113,838,505	89,819,773	327,968,369	1,223,270,610
Structural attributes	Number of documents	1,657	30	3,994	282
	Number of texts	1,657	111,951	3,994	1,843,528
	Number of sentences	9,782,002	13,606,198	24,318,736	143,196,252
Further information	reference	YES
	representative	NO
	publication date	2021
	foreign languages	40
	tagged languages	35
	lemmatized languages	35
	syntactically annotated languages	35

Access to the texts

After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

InterCorp can be accessed via a standard web browser from KonText, the integrated search interface of the Czech National Corpus. A tutorial is available in Czech, for one of the ICNC corpora also in English and for InterCorp a summary also in English.

After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact Alexandr Rosen if you are interested.

New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the Universal Dependencies scheme.

Main differences between releases 13 and 13ud

In release 13ud, out of the total number of 41 languages (including Czech), 36 are linguistically annotated; in addition, all such languages are syntactically annotated.
Texts are annotated in the same way in all languages, according to the UD standard (Universal Dependencies).
For a detailed description of UD as used in the annotation of InterCorp see Universal Dependencies.
Annotation was performed for all languages by UDPipe, based on the data created in the UD project.¹⁾

Texts in the corpus

InterCorp release 13ud contains the same texts as InterCorp release 13. They differ only in linguistic annotation. However, the token and word count data in release 13ud may differ slightly due to a different tokenization method.

The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:

Political commentaries published by Project Syndicate and VoxEurop (formerly PressEurop)
A package of legal texts of the European Union form the Acquis Communautaire corpus
Proceedings of the European Parliament dated 2007–2011 from the Europarl corpus
Film subtitles from the Open Subtitles database
Translations of the Bible

These texts have been aligned automatically: search results may include a higher number of misaligned segments. Moreover, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.

Setup of the parallel corpus – the core and collections

Setup of the parallel corpus – the core

Setup of the parallel corpus – collections

Corpus size in thousands of words

Language		Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Bible	Total
ar	Arabic	34	0	0	0	0	0	0	34
be	Belarusian	5,718	0	0	0	0	0	0	5,718
bg	Bulgarian	7,068	0	0	13,577	9,083	0	0	29,728
ca	Catalan	7,938	0	0	0	0	0	736	8,674
da	Danish	7,136	0	0	20,313	13,916	14,429	657	56,451
de	German	37,633	4,704	2,483	20,610	13,088	8,392	724	87,634
el	Greek	0	0	0	23,853	15,404	23,709	0	62,966
en	English	33,569	4,856	2,670	22,902	15,576	52,106	730	132,409
es	Spanish	26,554	5,614	2,859	26,262	16,249	36,650	0	114,187
et	Estonian	0	0	0	14,896	10,899	10,298	0	36,093
fi	Finnish	5,656	0	0	15,269	10,108	15,047	543	46,622
fr	French	19,773	5,600	3,046	26,200	17,179	25,986	764	98,547
he	Hebrew	0	0	0	0	0	16,221	0	16,221
hi	Hindi	409	0	0	0	0	0	0	409
hr	Croatian	21,923	0	0	0	0	19,048	571	41,543
hu	Hungarian	6,444	0	0	17,852	12,198	21,115	0	57,609
is	Icelandic	0	0	0	0	0	1,581	0	1,581
it	Italian	14,525	1,252	2,747	23,771	15,494	14,700	684	73,174
ja	Japanese	2,189	0	0	0	0	477	0	2,666
lt	Lithuanian	421	0	0	17,316	11,213	558	471	29,979
lv	Latvian	2,646	0	0	17,522	11,682	280	537	32,667
mk	Macedonian	8,881	0	0	0	0	1,877	0	10,758
ms	Malay	0	0	0	0	0	3,521	0	3,521
mt	Maltese	0	0	0	13,935	0	0	0	13,935
nl	Dutch	16,216	813	2,953	23,416	15,558	29,373	717	89,045
no	Norwegian	7,727	0	0	0	0	0	722	8,449
pl	Polish	26,200	0	2,380	19,604	12,817	26,576	583	88,161
pt	Portuguese	4,981	554	2,782	24,598	15,193	41,468	706	90,282
rn	Romani	14	0	0	0	0	0	0	14
ro	Romanian	4,219	0	2,738	8,092	9,446	34,128	0	58,622
ru	Russian	8,642	3,984	0	0	0	6,887	565	20,078
sk	Slovak	8,543	0	0	18,399	12,727	5,133	561	45,363
sl	Slovene	3,871	0	0	18,528	12,251	17,061	0	51,711
sq	Albanian	0	0	0	0	0	2,003	0	2,003
sr	Serbian	11,582	0	0	0	0	20,727	0	32,308
sv	Swedish	15,790	0	0	19,542	13,784	14,666	638	64,419
tr	Turkish	0	0	0	0	0	21,190	0	21,190
uk	Ukrainian	11,459	0	0	0	0	244	596	12,299
vi	Vietnamese	0	0	0	0	0	1,474	0	1,474
zh	Chinese	127	240	0	0	0	2,247	0	2,614
Subtotal		327,887	27,616	24,658	406,459	263,864	489,169	11,504	1,551,157
cs	Czech	113,839	4,351	2,310	19,085	12,908	50,604	562	203,658
TOTAL		441,725	31,967	26,968	425,543	276,772	539,774	12,066	1,754,815

N.B. 1: Languages printed in italics have no linguistic annotation.

N.B. 2: Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Acknowledgements

We are grateful for the possibility to use the following texts and software:

Texts:

The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the Czech Biblical Society, especially Petr Fryš.
Fiction in many Slavic and some other languages from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
Political commentaries in a number of languages from the site Project Syndicate
Newspaper texts in a number of languages from the Presseurop/VoxEurop server
Legal texts in EU languages from the JRC-ACQUIS corpus
Proceedings of the European Parliament from the EuroParl corpus
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober
Film subtitles from the database Open Subtitles

Pre-processing

Parallel text editor InterText by Pavel Vondřička
Aligner Hunalign
Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit

Linguistic annotation

* UDPipe (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel)

How to cite

If you publish results based on InterCorp we would appreciate a link to the project site www.intercorp.korpus.cz. In your scientific publications please cite the following paper:

Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).

For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.

When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:

Rosen, A., Vavřín, M., Zasina, A. J. (2022). The InterCorp Corpus – Czech²⁾, version 13ud of 22 December 2021. Institute of the Czech National Corpus, Charles University, Prague 2021. Available on-line: https://kontext.korpus.cz/

¹⁾

The tool uses all data for the given language, ie all treebanks listed on https://lindat.mff.cuni.cz/services/udpipe/IUDPipe. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.

²⁾

Insert languages actually used.

Trace: • verze13ud