This is an old revision of the document!

InterCorp: Release 4

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

There are several aspects that make InterCorp special among the corpora published by ICNC:

InterCorp is accessible through Park, a purpose-built interface, using the corpus manager Manatee by Pavel Rychlý; a brief Park user manual is available here; Park was written by Michal Štourač.
Non-parallel versions of all InterCorp bitexts are available through a web-based version of the interface Bonito, so that its search and processing features (filter, sort, collocations, frequency distribution, random sampling, etc.) can be used also with texts from the parallel corpus. Czech texts included in the parallel corpus can also be used this way.
InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing.

Texts in the corpus

The bulk of InterCorp consists of fiction in Czech and other languages, semi-automatically aligned, and a selection of political commentaries published by Project Syndicate and Presseurop. These texts have been aligned automatically: search results may include more misaligned segments.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of InterCorp as of September 2011 (InterCorp v. 4, see Version history) is 92 290 000 words in aligned foreign language texts. This number already includes Project Syndicate (approximately 2.3 - 3 million words in cs, de, en, es, fr, ru) and Presseurop (approximately 0.8 million words in cs, de, en, es, fr, it, nl, pl, pt, ro). The corpus composition can be seen in the following chart where the common title “beletrie” (“fiction”) denotes all manually aligned texts, mostly (but not exclusively) containing fiction. The bars show the size in millions of words.

Composition of the parallel corpora

The following table presents the sizes of parallel corpora in the individual languages. Numbers in each row show number of words in the given language (in thousands) which are also available in the language indicated by the column header. For instance, virtual Bulgarian-Croatian corpus contains 187 thousand words in Bulgarian (1st row - “bg”, 9th column - “hr”) and 189 thousand words in Croatian (9th row - “hr”, 1st column - “bg”). The second (highlighted) column shows the number of words aligned with Czech and thus also the overall size of the monolingual corpus of the language given on the corresponding row.

bg	cs	da	de	en	es	fi	fr	hr	hu	it	lt	lv	nl	no	pl	pt	ro	ru	sl	sk	sr	sv
bg	1135	1135	0	82	74	82	74	0	187	141	156	0	0	74	0	156	74	0	0	0	0	0	156
cs	1139	46196	149	10544	6287	12177	1678	4075	6415	1162	3502	418	1128	4175	1815	6217	2109	1416	3563	893	7072	2521	4633
da	0	190	190	87	130	0	0	0	87	0	0	87	0	0	130	136	0	0	130	87	0	87	87
de	87	12167	83	12167	3802	4953	176	3717	1967	295	1654	259	22	1973	1020	1850	749	835	2934	428	431	552	989
en	80	7297	135	3821	7297	3761	438	3448	519	104	1053	381	2	1092	397	1449	876	954	2836	286	0	383	343
es	90	14237	0	5331	4141	14237	353	4072	2409	164	2924	169	0	2150	670	1834	1098	1128	2988	98	133	790	1375
fi	62	1435	0	128	332	325	1435	107	234	73	62	73	0	109	107	242	62	73	81	73	0	98	164
fr	0	5234	0	4228	3947	4207	155	5234	515	0	1181	0	0	948	155	1272	870	873	3003	68	0	78	414
hr	189	6735	76	1736	461	2175	280	409	6735	83	1491	324	43	1084	870	1160	447	277	232	352	54	927	997
hu	132	1123	0	256	81	135	81	0	79	1123	0	81	0	56	202	287	0	81	202	283	284	115	0
it	174	4028	0	1678	1059	2815	84	1064	1607	0	4028	162	0	1308	844	1214	1384	798	62	72	0	732	849
lt	0	358	58	185	259	115	71	0	253	71	113	358	16	196	173	297	43	71	101	129	13	171	58
lv	0	1075	0	18	2	0	0	0	39	0	0	18	1075	2	2	36	0	0	0	19	233	0	0
nl	80	5203	0	2202	1176	2273	149	968	1286	73	1433	281	3	5203	724	1632	1039	1047	64	78	0	482	574
no	0	2158	135	965	394	693	144	144	990	164	891	259	3	706	2158	597	524	0	407	255	263	759	678
pl	143	6173	111	1652	1256	1536	276	1052	1101	296	1063	346	37	1300	503	6173	829	900	237	283	178	220	553
pt	82	2503	0	853	931	1105	82	854	486	0	1454	66	0	1003	519	1002	2503	855	66	0	0	519	263
ro	0	1697	0	900	967	1107	106	817	327	106	814	106	0	968	0	1064	815	1697	0	106	0	578	85
ru	0	3619	99	2636	2581	2444	92	2382	215	197	50	123	0	52	387	230	52	0	3619	268	197	71	163
sl	0	992	81	407	257	106	91	60	377	308	78	172	21	78	297	317	0	91	297	992	237	243	189
sk	0	6961	0	361	0	104	0	0	50	290	0	15	245	0	276	175	0	0	200	220	6961	84	117
sr	0	2736	77	503	346	751	124	62	943	127	692	222	0	405	681	237	477	509	77	242	100	2736	271
sv	178	5234	83	954	339	1366	214	371	1091	0	859	83	0	518	610	645	227	87	187	196	129	256	5234

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

See Park Manual for advice on the use of tags in queries.

Please note: work in progress

The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following software and data:

Pre-processing

Sentence splitter for Czech by Pavel Květoň
Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
Sentence splitter Punkt for all other languages from Natural Language Toolkit
Aligner Hunalign

Taggers/lemmatizers:

Morče for Czech
TreeTagger for English, German, French, Italian, Dutch, Spanish, Bulgarian and Russian
Morfeusz and TaKIPI for Polish
HunPOS for Hungarian
Tagger for Slovak
Tagger for Lithuanian
Analyzer and tagger for Norwegian

Corpus Query Engine:

Manatee

Data:

Newspaper articles in a number of languages from the site Project Syndicate
Slovak-Czech concordances from the Slovak National Corpus
Short stories in a number of languages My 1989 from Goethe Institut
A number of texts in the Czech-Lithuanian section of the corpus from Patrick Corness
George Orwell's novel 1984 in a number of languages from the Multext-East corpus
Ukrainian and Polish texts from the PolUkr corpus (in prep.)
Texts in a number of languages from the ParaSol corpus (in prep.)
Newspaper texts from the Presseurop server
Legal texts in EU languages from the JRC-ACQUIS corpus (in prep.)
Norwegian texts from the publishers Aschehoug & co., Cappelen Forlag and Forlaget Oktober