Name | Czech – core | Czech – collections | other – core | other – collections | |
---|---|---|---|---|---|
Positions | Number of tokens | 127,413,531 | 118,069,703 | 311,809,130 | 1,551,411,225 |
Number of word forms | 102,609,763 | 89,841,420 | 258,807,848 | 1,225,034,182 | |
Structural attributes | Number of documents | 1,507 | 6 | 3,232 | 106 |
Number of div | 1,507 | 111,672 | 3,232 | 1,841,341 | |
Number of sentences | 8,803,067 | 13,593,172 | 19,207,592 | 142,734,479 | |
Further information | reference | YES | |||
representative | NO | ||||
publication date | 2017 | ||||
foreign languages | 39 | ||||
tagged languages | 23 | ||||
lemmatized languages | 22 |
After registration the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.
New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).
If you publish results based on InterCorp we would appreciate a link to the project site www.korpus.cz/intercorp. In your scientific publications please cite the following paper:
Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics. Vol. 13, no. 3, p. 411–427 (bibtex, electronic edition at ingentaConnect, preprint version).
For more references see the repository of bibliographical items based on the CNC. All references to work based on InterCorp are welcome. See here for details.
When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:
Rosen, A., Vavřín, M., Zasina, A. (2017) The InterCorp Corpus – English, German1), version 10 of ?? September 2017. Institute of the Czech National Corpus, Charles University, Prague 2017. Available on-line: http://www.korpus.cz
The core of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called collections. The choice in the present release includes:
These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 10 from September 2017 is 258 mil. words in the aligned foreign language texts in the core part and 1,225 mil. words in the collections (see Version history). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.
Language | Core | Syndicate | Presseurop | Acquis | Europarl | Subtitles | Bible | Total | |
---|---|---|---|---|---|---|---|---|---|
ar | Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 34 |
be | Belarusian | 3,967 | 0 | 0 | 0 | 0 | 0 | 0 | 3,967 |
bg | Bulgarian | 6,465 | 0 | 0 | 13,572 | 9,067 | 0 | 0 | 29,103 |
ca | Catalan | 4,645 | 0 | 0 | 0 | 0 | 0 | 736 | 5,381 |
da | Danish | 4,548 | 0 | 0 | 20,313 | 13,916 | 14,430 | 657 | 53,581 |
de | German | 33,053 | 4,457 | 2,483 | 20,610 | 13,089 | 8,393 | 724 | 82,809 |
el | Greek | 0 | 0 | 0 | 23,854 | 15,404 | 23,715 | 0 | 62,972 |
en | English | 24,567 | 4,604 | 2,670 | 22,902 | 15,576 | 52,123 | 730 | 123,172 |
es | Spanish | 21,036 | 5,322 | 2,859 | 26,262 | 16,249 | 36,650 | 0 | 108,377 |
et | Estonian | 0 | 0 | 0 | 14,896 | 10,899 | 10,298 | 0 | 36,093 |
fi | Finnish | 4,074 | 0 | 0 | 15,489 | 10,175 | 15,098 | 544 | 45,380 |
fr | French | 15,073 | 5,391 | 3,046 | 26,200 | 17,179 | 25,991 | 764 | 93,644 |
he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16,221 | 0 | 16,221 |
hi | Hindu | 409 | 0 | 0 | 0 | 0 | 0 | 0 | 409 |
hr | Croatian | 20,146 | 0 | 0 | 0 | 0 | 19,049 | 571 | 39,767 |
hu | Hungarian | 5,626 | 0 | 0 | 17,853 | 12,198 | 21,115 | 0 | 56,791 |
is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1,585 | 0 | 1,585 |
it | Italian | 10,784 | 1,141 | 2,747 | 23,771 | 15,494 | 14,701 | 684 | 69,321 |
ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113 | 0 | 113 |
lt | Lithuanian | 358 | 0 | 0 | 17,316 | 11,213 | 558 | 471 | 29,916 |
lv | Latvian | 2,025 | 0 | 0 | 17,533 | 11,682 | 280 | 0 | 31,521 |
mk | Macedonian | 5,939 | 0 | 0 | 0 | 0 | 1,877 | 0 | 7,816 |
ms | Malay | 0 | 0 | 0 | 0 | 0 | 3,521 | 0 | 3,521 |
mt | Maltese | 0 | 0 | 0 | 13,953 | 0 | 0 | 0 | 13,953 |
nl | Dutch | 13,454 | 711 | 2,953 | 23,416 | 15,558 | 29,373 | 717 | 86,181 |
no | Norwegian | 5,305 | 0 | 0 | 0 | 0 | 0 | 722 | 6,026 |
pl | Polish | 23,238 | 0 | 2,378 | 19,594 | 12,811 | 26,572 | 583 | 85,176 |
pt | Portuguese | 3,473 | 520 | 3,000 | 27,301 | 16,485 | 43,392 | 760 | 94,930 |
rn | Romani | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 14 |
ro | Romanian | 3,888 | 0 | 2,738 | 8,092 | 9,446 | 34,129 | 0 | 58,293 |
ru | Russian | 5,978 | 3,767 | 0 | 0 | 0 | 6,887 | 565 | 17,197 |
sk | Slovak | 8,545 | 0 | 0 | 18,400 | 12,734 | 5,134 | 561 | 45,375 |
sl | Slovenian | 2,952 | 0 | 0 | 18,485 | 12,241 | 17,025 | 0 | 50,702 |
sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2,004 | 0 | 2,004 |
sr | Serbian | 10,207 | 0 | 0 | 0 | 0 | 20,728 | 0 | 30,934 |
sv | Swedish | 10,269 | 0 | 0 | 19,609 | 13,840 | 14,694 | 638 | 59,051 |
tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21,191 | 0 | 21,191 |
uk | Ukrainian | 8,736 | 0 | 0 | 0 | 0 | 246 | 600 | 9,583 |
vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1,474 | 0 | 1,474 |
Subtotal | 361,418 | 30,044 | 27,189 | 428,621 | 278,178 | 539,250 | 11,593 | 1,676,293 | |
cs | Czech | 102,610 | 4,131 | 2,315 | 19,218 | 12,923 | 50,688 | 566 | 192,451 |
TOTAL | 464,027 | 34,175 | 29,504 | 447,840 | 291,101 | 589,938 | 12,159 | 1,868,744 |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Texts in the following languages have received some morphosyntactic annotation.
*) The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].
**) Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.
***) Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.
Tag formats specified in tagset descriptions differ from those actually used in the corpus also in some other languages. Please check the tag format before making a tag query if you are not sure. In a page displaying results open the View/Corpus-specific settings… menu to check the tag option in the Positional attributes box and choose the for each token option in the Viewing options box.
Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as can't or I'm, which are split by the tagger into two parts (ca+n't and I+'m) with corresponding lemmas and tags. Similarly with Polish forms byłam or gdybyś (była+m and gdyby+ś). Tokenization may even introduce errors: gdzie ś za Wisłą. In this context, gdzieś is not a contraction. A query intended to find the whole contracted form should be typed in as a Phrase, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.
Morphological tags including characters with a special meaning in regular expressions, e.g. “$” in the English tag “wp$”, must be preceded in queries by a backslash: tag=“wp\$”.
Structure | Attribute | Description | Values |
---|---|---|---|
doc | doc.id | unique document identifier | text |
doc.lang | language | ar / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / rn / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh | |
doc.version | version | number | |
doc.wordcount | document size in words | number | |
div | div.id | text identification | author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE / _BIBLE |
div.group | division in | Core / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible | |
div.wordcount | number of words | number | |
div.author | author | last name, first name | |
div.title | full title | text | |
div.publisher | publisher | text | |
div.pubplace | publication place | text | |
div.pubyear | publication year | date | |
div.txtype | text type | discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles / religious | |
div.original | is the text an original? | Yes / No | |
div.srclang | language of the original | ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh | |
div.translator | translator | last name, first name | |
div.transsex | translator's gender | F / M | |
div.authsex | author's gender | F / M | |
p | p.id | unique paragraph identifier | text |
s | s.id | unique sentence identifier | text |
We are grateful for the possibility to use the following texts and software: