AplikaceAplikace
Nastavení

InterCorp: Release 4

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

There are several aspects that make InterCorp special among the corpora published by ICNC:

  • InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
  • InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing.

Texts in the corpus

The bulk of InterCorp consists of fiction in Czech and other languages, semi-automatically aligned, and a selection of political commentaries published by Project Syndicate and Presseurop. These texts have been aligned automatically: search results may include more misaligned segments.

Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of InterCorp as of September 2011 (InterCorp v. 4, see Version history) is 92 290 000 words in aligned foreign language texts. This number already includes Project Syndicate (approximately 2.3 - 3 million words in cs, de, en, es, fr, ru) and Presseurop (approximately 0.8 million words in cs, de, en, es, fr, it, nl, pl, pt, ro). The corpus composition can be seen in the following chart where the common title “beletrie” (“fiction”) denotes all manually aligned texts, mostly (but not exclusively) containing fiction. The bars show the size in millions of words.

Composition of the parallel corpora

The following table presents the sizes of parallel corpora in the individual languages. Numbers in each row show number of words in the given language (in thousands) which are also available in the language indicated by the column header. For instance, virtual Bulgarian-Croatian corpus contains 187 thousand words in Bulgarian (1st row - “bg”, 9th column - “hr”) and 189 thousand words in Croatian (9th row - “hr”, 1st column - “bg”). The second (highlighted) column shows the number of words aligned with Czech and thus also the overall size of the monolingual corpus of the language given on the corresponding row.

bg cs da de en es fi fr hr hu it lt lv nl no pl pt ro ru sl sk sr sv
bg 1135 1135 0 82 74 82 74 0 187 141 156 0 0 74 0 156 74 0 0 0 0 0 156
cs 1139 46196 149 10544 6287 12177 1678 4075 6415 1162 3502 418 1128 4175 1815 6217 2109 1416 3563 893 7072 2521 4633
da 0 190 190 87 130 0 0 0 87 0 0 87 0 0 130 136 0 0 130 87 0 87 87
de 87 12167 83 12167 3802 4953 176 3717 1967 295 1654 259 22 1973 1020 1850 749 835 2934 428 431 552 989
en 80 7297 135 3821 7297 3761 438 3448 519 104 1053 381 2 1092 397 1449 876 954 2836 286 0 383 343
es 90 14237 0 5331 4141 14237 353 4072 2409 164 2924 169 0 2150 670 1834 1098 1128 2988 98 133 790 1375
fi 62 1435 0 128 332 325 1435 107 234 73 62 73 0 109 107 242 62 73 81 73 0 98 164
fr 0 5234 0 4228 3947 4207 155 5234 515 0 1181 0 0 948 155 1272 870 873 3003 68 0 78 414
hr 189 6735 76 1736 461 2175 280 409 6735 83 1491 324 43 1084 870 1160 447 277 232 352 54 927 997
hu 132 1123 0 256 81 135 81 0 79 1123 0 81 0 56 202 287 0 81 202 283 284 115 0
it 174 4028 0 1678 1059 2815 84 1064 1607 0 4028 162 0 1308 844 1214 1384 798 62 72 0 732 849
lt 0 358 58 185 259 115 71 0 253 71 113 358 16 196 173 297 43 71 101 129 13 171 58
lv 0 1075 0 18 2 0 0 0 39 0 0 18 1075 2 2 36 0 0 0 19 233 0 0
nl 80 5203 0 2202 1176 2273 149 968 1286 73 1433 281 3 5203 724 1632 1039 1047 64 78 0 482 574
no 0 2158 135 965 394 693 144 144 990 164 891 259 3 706 2158 597 524 0 407 255 263 759 678
pl 143 6173 111 1652 1256 1536 276 1052 1101 296 1063 346 37 1300 503 6173 829 900 237 283 178 220 553
pt 82 2503 0 853 931 1105 82 854 486 0 1454 66 0 1003 519 1002 2503 855 66 0 0 519 263
ro 0 1697 0 900 967 1107 106 817 327 106 814 106 0 968 0 1064 815 1697 0 106 0 578 85
ru 0 3619 99 2636 2581 2444 92 2382 215 197 50 123 0 52 387 230 52 0 3619 268 197 71 163
sl 0 992 81 407 257 106 91 60 377 308 78 172 21 78 297 317 0 91 297 992 237 243 189
sk 0 6961 0 361 0 104 0 0 50 290 0 15 245 0 276 175 0 0 200 220 6961 84 117
sr 0 2736 77 503 346 751 124 62 943 127 692 222 0 405 681 237 477 509 77 242 100 2736 271
sv 178 5234 83 954 339 1366 214 371 1091 0 859 83 0 518 610 645 227 87 187 196 129 256 5234

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

See Park Manual for advice on the use of tags in queries.

Please note: work in progress

The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following software and data:

Pre-processing

  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit
  • Aligner Hunalign

Taggers/lemmatizers:

Corpus Query Engine:

Data:

Last update: 5 October 2011

See also