Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
en:cnk:intercorp:verze8 [2016/06/03 19:37] – created vaclavhorky | en:cnk:intercorp:verze8 [2018/07/30 15:12] (current) – [Access to the texts] vaclavcvrcek | ||
---|---|---|---|
Line 1: | Line 1: | ||
~~NOTOC~~ | ~~NOTOC~~ | ||
- | ====== InterCorp ====== | + | ====== InterCorp |
Line 25: | Line 25: | ||
After [[http:// | After [[http:// | ||
- | InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http:// | + | InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http:// |
After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. | After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. | ||
Line 34: | Line 34: | ||
===== References ===== | ===== References ===== | ||
- | We would appreciate a link to the project site www.korpus.cz/ | + | If you publish results based on InterCorp we would appreciate a link to the project site [[http://www.korpus.cz/intercorp|www.korpus.cz/intercorp]]. In your scientific publications please cite the following paper: |
- | For more references see the [[https://biblio.korpus.cz/|repository of bibliographical items based on the CNC]]. All references to work using InterCorp is welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details. | + | <WRAP round info 50%> |
+ | Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. // | ||
+ | ([[http://ucnk.ff.cuni.cz/intercorp/? | ||
+ | For more references see the [[https:// | ||
+ | When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, | ||
+ | |||
+ | Rosen, A., Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), | ||
+ | |||
+ | </ | ||
===== Texts in the corpus ===== | ===== Texts in the corpus ===== | ||
Line 60: | Line 68: | ||
- | ===== Corpus size in the number | + | ===== Corpus size in thousands |
^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ | ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ | ||
- | | ar | Arabic | 34,325 | 0 | 0 | 0 | 0 | 0 | 34,325 | | + | | ar | Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 34 | |
- | | be | Belarusian | 2,152,724 | 0 | 0 | 0 | 0 | 0 | 2,152,724 | | + | | be | Belarusian | 2 152 | 0 | 0 | 0 | 0 | 0 | 2 152 | |
- | | bg | Bulgarian | 5,240,831 | 0 | 0 | 13,816,405 | 9,083,403 | 0 | 28,140,639 | | + | | bg | Bulgarian | 5 240 | 0 | 0 | 13 816 | 9 083 | 0 | 28 140 | |
- | | ca | Catalan | 4,632,696 | 0 | 0 | 0 | 0 | 0 | 4,632,696 | | + | | ca | Catalan | 4 632 | 0 | 0 | 0 | 0 | 0 | 4 632 | |
- | | da | Danish | 3,016,838 | 0 | 0 | 21,679,997 | 13,915,841 | 14,429,778 | 53,042,454 | | + | | da | Danish | 3 016 | 0 | 0 | 21 679 | 13 915 | 14 429 | 53 042 | |
- | | de | German | 27,681,897 | 3,725,002 | 2,482,920 | 21,723,929 | 13,089,209 | 8,366,765 | 77,069,722 | | + | | de | German | 27 681 | 3 725 | 2 482 | 21 723 | 13 089 | 8 366 | 77 069 | |
- | | el | Greek | 0 | 0 | 0 | 25,069,611 | 15,403,662 | 23,714,597 | 64,187,870 | | + | | el | Greek | 0 | 0 | 0 | 25 069 | 15 403 | 23 714 | 64 187 | |
- | | en | English | 15,488,167 | 3,818,127 | 2,670,157 | 24,207,801 | 15,580,109 | 52,101,283 | 113,865,644 | | + | | en | English | 15 488 | 3 818 | 2 670 | 24 207 | 15 580 | 52 101 | 113 865 | |
- | | es | Spanish | 17,475,748 | 4,324,428 | 2,816,401 | 27,001,343 | 15,885,394 | 36,378,715 | 103,882,029 | | + | | es | Spanish | 17 475 | 4 324 | 2 816 | 27 001 | 15 885 | 36 378 | 103 882 | |
- | | et | Estonian | 0 | 0 | 0 | 15,962,544 | 10,899,550 | 10,296,031 | 37,158,125 | | + | | et | Estonian | 0 | 0 | 0 | 15 962 | 10 899 | 10 296 | 37 158 | |
- | | fi | Finnish | 3,426,226 | 0 | 0 | 16,455,144 | 10,175,256 | 15,097,653 | 45,154,279 | | + | | fi | Finnish | 3 426 | 0 | 0 | 16 455 | 10 175 | 15 097 | 45 154 | |
- | | fr | French | 9,170,042 | 4,393,051 | 2,928,227 | 27,351,591 | 17,178,444 | 25,961,848 | 86,983,203 | | + | | fr | French | 9 170 | 4 393 | 2 928 | 27 351 | 17 178 | 25 961 | 86 983 | |
- | | he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16,221,237 | 16,221,237 | | + | | he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16 221 | 16 221 | |
- | | hi | Hindi | 408,616 | 0 | 0 | 0 | 0 | 0 | 408,616 | | + | | hi | Hindu | 408 | 0 | 0 | 0 | 0 | 0 | 408 | |
- | | hr | Croatian | 15,479,547 | 0 | 0 | 0 | 0 | 19,092,559 | 34,572,106 | | + | | hr | Croatian | 15 479 | 0 | 0 | 0 | 0 | 19 092 | 34 572 | |
- | | hu | Hungarian | 5,387,533 | 0 | 0 | 19,176,514 | 12,306,692 | 21,239,634 | 58,110,373 | | + | | hu | Hungarian | 5 387 | 0 | 0 | 19 176 | 12 306 | 21 239 | 58 110 | |
- | | is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1,584,758 | 1,584,758 | | + | | is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1 584 | 1 584 | |
- | | it | Italian | 7,247,545 | 651,502 | 2,707,648 | 24,849,477 | 15,489,468 | 14,653,613 | 65,599,253 | | + | | it | Italian | 7 247 | 651 | 2 707 | 24 849 | 15 489 | 14 653 | 65 599 | |
- | | ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113,32 | 113,32 | | + | | ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113 | 113 | |
- | | lt | Lithuanian | 358,253 | 0 | 0 | 18,392,644 | 11,212,864 | 557,961 | 30,521,722 | | + | | lt | Lithuanian | 358 | 0 | 0 | 18 392 | 11 212 | 557 | 30 521 | |
- | | lv | Latvian | 1,336,888 | 0 | 0 | 18,744,927 | 11,688,597 | 280,117 | 32,050,529 | | + | | lv | Latvian | 1 336 | 0 | 0 | 18 744 | 11 688 | 280 | 32 050 | |
- | | mk | Macedonian | 3,741,900 | 0 | 0 | 0 | 0 | 1,877,210 | 5,619,110 | | + | | mk | Macedonian | 3 741 | 0 | 0 | 0 | 0 | 1 877 | 5 619 | |
- | | ms | Malay | 0 | 0 | 0 | 0 | 0 | 3,520,701 | 3,520,701 | | + | | ms | Malay | 0 | 0 | 0 | 0 | 0 | 3 520 | 3 520 | |
- | | mt | Maltese | 0 | 0 | 0 | 14,133,133 | 0 | 0 | 14,133,133 | | + | | mt | Maltese | 0 | 0 | 0 | 14 133 | 0 | 0 | 14 133 | |
- | | nl | Dutch | 9,961,680 | 313,998 | 2,955,637 | 24,746,144 | 15,563,231 | 29,362,826 | 82,903,516 | | + | | nl | Dutch | 9 961 | 313 | 2 955 | 24 746 | 15 563 | 29 362 | 82 903 | |
- | | no | Norwegian | 4,815,797 | 0 | 0 | 0 | 0 | 0 | 4,815,797 | | + | | no | Norwegian | 4 815 | 0 | 0 | 0 | 0 | 0 | 4 815 | |
- | | pl | Polish | 17,516,332 | 0 | 2,378,025 | 20,627,627 | 12,811,143 | 26,572,483 | 79,905,610 | | + | | pl | Polish | 17 516 | 0 | 2 378 | 20 627 | 12 811 | 26 572 | 79 905 | |
- | | pt | Portuguese | 2,393,287 | 369,434 | 2,999,903 | 28,602,556 | 16,484,692 | 43,391,919 | 94,241,791 | | + | | pt | Portuguese | 2 393 | 369 | 2 999 | 28 602 | 16 484 | 43 391 | 94 241 | |
- | | ro | Romanian | 3,432,615 | 0 | 2,737,807 | 8,199,565 | 9,446,369 | 34,128,511 | 57,944,867 | | + | | ro | Romanian | 3 432 | 0 | 2 737 | 8 199 | 9 446 | 34 128 | 57 944 | |
- | | ru | Russian | 3,337,545 | 3,174,152 | 0 | 0 | 0 | 6,885,753 | 13,397,450 | | + | | ru | Russian | 3 337 | 3 174 | 0 | 0 | 0 | 6 885 | 13 397 | |
- | | sk | Slovak | 7,401,998 | 0 | 0 | 19,222,784 | 12,734,444 | 5,134,150 | 44,493,376 | | + | | sk | Slovak | 7 401 | 0 | 0 | 19 222 | 12 734 | 5 134 | 44 493 | |
- | | sl | Slovenian | 900,221 | 0 | 0 | 19,645,598 | 12,240,548 | 17,024,593 | 49,810,960 | | + | | sl | Slovenian | 900 | 0 | 0 | 19 645 | 12 240 | 17 024 | 49 810 | |
- | | sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2,003,579 | 2,003,579 | | + | | sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2 003 | 2 003 | |
- | | sr | Serbian | 8,823,894 | 0 | 0 | 0 | 0 | 20,776,850 | 29,600,744 | | + | | sr | Serbian | 8 823 | 0 | 0 | 0 | 0 | 20 776 | 29 600 | |
- | | sv | Swedish | 8,138,161 | 0 | 0 | 20,585,800 | 13,840,373 | 14,693,861 | 57,258,195 | | + | | sv | Swedish | 8 138 | 0 | 0 | 20 585 | 13 840 | 14 693 | 57 258 | |
- | | tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21,190,828 | 21,190,828 | | + | | tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21 190 | 21 190 | |
- | | uk | Ukrainian | 5,054,034 | 0 | 0 | 0 | 0 | 246,059 | 5,300,093 | | + | | uk | Ukrainian | 5 054 | 0 | 0 | 0 | 0 | 246 | 5 300 | |
- | | vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1,473,591 | 1,473,591 | | + | | vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1 473 | 1 473 | |
- | | **Subtotal** | | 194,055,340 | 20,769,694 | 24,676,725 | 430,195,134 | 265,029,289 | 488,372,783 | 1,423,098,965 | | + | | **Subtotal** | | 194 055 | 20 769 | 24 676 | 430 195 | 265 029 | 488 372 | 1 423 098 | |
- | | cs | Czech | 84,718,325 | 3,416,272 | 2,315,118 | 20,303,101 | 12,922,658 | 50,688,186 | 174,363,660 | | + | | cs | Czech | 84 718 | 3 416 | 2 315 | 20 303 | 12 922 | 50 688 | 174 363 | |
- | | **TOTAL** | | 278,773,665 | 24,185,966 | 26,991,843 | 450,498,235 | 277,951,947 | 539,060,969 | 1,597,462,625 | | + | | **TOTAL** | | 278 773 | 24 185 | 26 991 | 450 498 | 277 951 | 539 060 | 1 597 462 | |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. | N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. | ||
Line 138: | Line 146: | ||
+ | ====Structural attributes==== | ||
+ | |||
+ | ^Structure^Attribute^Description^Values^ | ||
+ | |doc|doc.id|unique document identifier|text| | ||
+ | | |doc.lang|language|ar / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh| | ||
+ | | |doc.version|version|number| | ||
+ | | |doc.wordcount|document size in words|number| | ||
+ | |div|div.id|text identification|author' | ||
+ | | |div.group|division in|//Core// / Acquis / Europarl / PressEurop / Subtitles / Syndicate| | ||
+ | | |div.wordcount|number of words|number| | ||
+ | | |div.author|author|last name, first name| | ||
+ | | |div.title|full title|text| | ||
+ | | |div.publisher|publisher|text| | ||
+ | | |div.pubplace|publication place|text| | ||
+ | | |div.pubyear|publication year|date| | ||
+ | | |div.txtype|text type|discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles| | ||
+ | | |div.original|is the text an original? | ||
+ | | |div.srclang|language of the original|ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh| | ||
+ | | |div.translator|translator|last name, first name| | ||
+ | | |div.transsex|translator' | ||
+ | | |div.authsex|author' | ||
+ | |p|p.id|unique paragraph identifier|text| | ||
+ | |s|s.id|unique sentence identifier|text| | ||
+ | |||
+ | |||
+ | ====Number of texts in the core of the corpus by languages of the text and languages of the original==== | ||
+ | |||
+ | ^ ^ Language of the original | ||
+ | ^ ↓ Language of the text ^ ar ^ be ^ bg ^ ca ^ cs ^ da ^ de ^ en ^ es ^ fi ^ fr ^ hi ^ hr ^ hu ^ it ^ lt ^ lv ^ mk ^ nl ^ no ^ pl ^ pt ^ ro ^ ru ^ sk ^ sl ^ sr ^ sv ^ uk ^ total ^ other ^ | ||
+ | ^ ar | 1 | | ||
+ | ^ be | | ||
+ | ^ bg | | ||
+ | ^ ca | | ||
+ | ^ cs | 1 | 3 | 19 | 1 | 267 | 9 | 134 | 242 | 127 | 24 | 95 | 2 | 26 | 1 | 20 | 1 | 7 | 1 | 30 | 7 | 49 | 21 | | ||
+ | ^ da | | ||
+ | ^ de | | ||
+ | ^ en | | ||
+ | ^ es | | ||
+ | ^ fi | | ||
+ | ^ fr | | ||
+ | ^ hi | | ||
+ | ^ hr | | ||
+ | ^ hu | | ||
+ | ^ it | | ||
+ | ^ lt | | ||
+ | ^ lv | | ||
+ | ^ mk | | ||
+ | ^ nl | | ||
+ | ^ no | | ||
+ | ^ pl | | ||
+ | ^ pt | | ||
+ | ^ ro | | ||
+ | ^ ru | | ||
+ | ^ sk | | ||
+ | ^ sl | | ||
+ | ^ sr | | ||
+ | ^ sv | | ||
+ | ^ uk | | ||
+ | ^ total | 2 | 6 | 39 | 3 | 810 | 19 | 349 | 950 | 335 | 57 | 241 | 4 | 56 | 2 | 89 | 5 | 18 | 3 | 84 | 22 | 128 | 72 | | ||
+ | |||
+ | * The table shows number of texts in the core of Intercorp. | ||
+ | * For each language which has texts in the core, number of texts by languages of the original (written in the caption) are shown. E. g. in Arabian, there is one Arabian, one Czech and one German original text in the core, that is total of three texts in Arabian (see the penultimate column). | ||
+ | * You can find out in columns, how many original texts in a language written in the caption are translated to other languages. Codes of these languages are in the first column. The last column shows the number of original texts in other languages, which are not in the core of Intercorp. | ||
+ | * In the diagonal, there is a number of original texts in a given language. E. g. in Hungarian and Romanian, there is none, in Romanian not even a translated one. | ||
Line 147: | Line 219: | ||
* Fiction in many Slavic and some other languages from [[http:// | * Fiction in many Slavic and some other languages from [[http:// | ||
- | * Political commentaries in a number of languages from the site [[http:// | + | * Political commentaries in a number of languages from the site [[http:// |
* Newspaper texts in a number of languages from the [[http:// | * Newspaper texts in a number of languages from the [[http:// | ||
* Legal texts in EU languages from the [[http:// | * Legal texts in EU languages from the [[http:// | ||
Line 182: | Line 254: | ||
- | |||
- | |||
- | ===== Citing InterCorp ===== | ||
- | |||
- | <WRAP round tip 70%> | ||
- | Rosen, A. – Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), | ||
- | |||
- | Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. // | ||
- | </ | ||