AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
en:cnk:intercorp:verze8 [2016/06/03 19:37] – created vaclavhorkyen:cnk:intercorp:verze8 [2018/07/30 15:12] (current) – [Access to the texts] vaclavcvrcek
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
-====== InterCorp ======+====== InterCorp Release 8 ======
  
  
Line 25: Line 25:
 After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
  
-InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial in Czech is available [[kurz:uvod|here]].+InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]].
  
 After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.
Line 34: Line 34:
 ===== References ===== ===== References =====
  
-We would appreciate a link to the project site www.korpus.cz/intercorp in results of your work based on InterCorp. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3):411–427 (bibtex((''@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411--427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}'')), [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ing entaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]).+If you publish results based on InterCorp we would appreciate a link to the project site [[http://www.korpus.cz/intercorp|www.korpus.cz/intercorp]]. In your scientific publications please cite the following paper: 
  
-For more references see the [[https://biblio.korpus.cz/|repository of bibliographical items based on the CNC]]. All references to work using InterCorp is welcomeSee [[https://www.korpus.cz/biblio_appeal.php|here]] for details.+<WRAP round info 50%> 
 +Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427 
 +([[http://ucnk.ff.cuni.cz/intercorp/?req=page:references_bibtex&lang=cs|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]])
  
 +For more references see the [[https://biblio.korpus.cz/|repository of bibliographical items based on the CNC]]. All references to work using InterCorp are welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details.
  
 +When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:
 +
 +Rosen, A., Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), version 7 from 19 Dec 2014//. Institute of the Czech National Corpus, Charles University, Prague 2014. Available on-line: http://www.korpus.cz
 +
 +</WRAP>
 ===== Texts in the corpus ===== ===== Texts in the corpus =====
  
Line 60: Line 68:
  
  
-===== Corpus size in the number of words =====+===== Corpus size in thousands of words =====
  
 ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^
-| ar | Arabic | 34,325 | 0 | 0 | 0 | 0 | 0 | 34,325 + ar  | Arabic | 34 |  0 |  0 |  0 |  0 |  0 |  34 | 
-| be | Belarusian | 2,152,724 | 0 | 0 | 0 | 0 | 0 | 2,152,724 + be  | Belarusian | 2 152 |  0 |  0 |  0 |  0 |  0 |  2 152 | 
-| bg | Bulgarian | 5,240,831 | 0 | 0 | 13,816,405 | 9,083,403 | 0 | 28,140,639 + bg  | Bulgarian | 5 240 |  0 |  0 |  13 816 |  9 083 |  0 |  28 140 | 
-| ca | Catalan | 4,632,696 | 0 | 0 | 0 | 0 | 0 | 4,632,696 + ca  | Catalan |  4 632 |  0 |  0 |  0 |  0 |  0 |  4 632 | 
-| da | Danish | 3,016,838 | 0 | 0 | 21,679,997 | 13,915,841 | 14,429,778 | 53,042,454 + da  | Danish |  3 016 |  0 |  0 |  21 679 |  13 915 |  14 429 |  53 042 | 
-| de | German | 27,681,897 | 3,725,002 | 2,482,920 | 21,723,929 | 13,089,209 | 8,366,765 | 77,069,722 + de  | German |  27 681 |  3 725 |  2 482 |  21 723 |  13 089 |  8 366 |  77 069 | 
-| el | Greek | 0 | 0 | 0 | 25,069,611 | 15,403,662 | 23,714,597 | 64,187,870 + el  | Greek |  0 |  0 |  0 |  25 069 |  15 403 |  23 714 |  64 187 | 
-| en | English | 15,488,167 | 3,818,127 | 2,670,157 | 24,207,801 | 15,580,109 | 52,101,283 | 113,865,644 + en  | English |  15 488 |  3 818 |  2 670 |  24 207 |  15 580 |  52 101 |  113 865 | 
-| es | Spanish | 17,475,748 | 4,324,428 | 2,816,401 | 27,001,343 | 15,885,394 | 36,378,715 | 103,882,029 + es  | Spanish |  17 475 |  4 324 |  2 816 |  27 001 |  15 885 |  36 378 |  103 882 | 
-| et | Estonian | 0 | 0 | 0 | 15,962,544 | 10,899,550 | 10,296,031 | 37,158,125 + et  | Estonian |  0 |  0 |  0 |  15 962 |  10 899 |  10 296 |  37 158 | 
-| fi | Finnish | 3,426,226 | 0 | 0 | 16,455,144 | 10,175,256 | 15,097,653 | 45,154,279 + fi  | Finnish |  3 426 |  0 |  0 |  16 455 |  10 175 |  15 097 |  45 154 | 
-| fr | French | 9,170,042 | 4,393,051 | 2,928,227 | 27,351,591 | 17,178,444 | 25,961,848 | 86,983,203 + fr  | French |  9 170 |  4 393 |  2 928 |  27 351 |  17 178 |  25 961 |  86 983 | 
-| he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16,221,237 | 16,221,237 + he  | Hebrew |  0 |  0 |  0 |  0 |  0 |  16 221 |  16 221 | 
-| hi | Hindi | 408,616 | 0 | 0 | 0 | 0 | 0 | 408,616 + hi  Hindu  408 |  0 |  0 |  0 |  0 |  0 |  408 | 
-| hr | Croatian | 15,479,547 | 0 | 0 | 0 | 0 | 19,092,559 | 34,572,106 + hr  | Croatian |  15 479 |  0 |  0 |  0 |  0 |  19 092 |  34 572 | 
-| hu | Hungarian | 5,387,533 | 0 | 0 | 19,176,514 | 12,306,692 | 21,239,634 | 58,110,373 + hu  | Hungarian |  5 387 |  0 |  0 |  19 176 |  12 306 |  21 239 |  58 110 | 
-| is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1,584,758 | 1,584,758 + is  | Icelandic |  0 |  0 |  0 |  0 |  0 |  1 584 |  1 584 | 
-| it | Italian | 7,247,545 | 651,502 | 2,707,648 | 24,849,477 | 15,489,468 | 14,653,613 | 65,599,253 + it  | Italian |  7 247 |  651 |  2 707 |  24 849 |  15 489 |  14 653 |  65 599 | 
-| ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113,32 | 113,32 + ja  | Japanese |  0 |  0 |  0 |  0 |  0 |  113 |  113 | 
-| lt | Lithuanian | 358,253 | 0 | 0 | 18,392,644 | 11,212,864 | 557,961 | 30,521,722 + lt  | Lithuanian |  358 |  0 |  0 |  18 392 |  11 212 |  557 |  30 521 | 
-| lv | Latvian | 1,336,888 | 0 | 0 | 18,744,927 | 11,688,597 | 280,117 | 32,050,529 + lv  | Latvian |  1 336 |  0 |  0 |  18 744 |  11 688 |  280 |  32 050 | 
-| mk | Macedonian | 3,741,900 | 0 | 0 | 0 | 0 | 1,877,210 | 5,619,110 + mk  | Macedonian |  3 741 |  0 |  0 |  0 |  0 |  1 877 |  5 619 | 
-| ms | Malay | 0 | 0 | 0 | 0 | 0 | 3,520,701 | 3,520,701 + ms  | Malay |  0 |  0 |  0 |  0 |  0 |  3 520 |  3 520 | 
-| mt | Maltese | 0 | 0 | 0 | 14,133,133 | 0 | 0 | 14,133,133 | + mt  | Maltese |  0 |  0 |  0 |  14 133 |  0 |  0 |  14 133 | 
-| nl | Dutch | 9,961,680 | 313,998 | 2,955,637 | 24,746,144 | 15,563,231 | 29,362,826 | 82,903,516 + nl  | Dutch |  9 961 |  313 |  2 955 |  24 746 |  15 563 |  29 362 |  82 903 | 
-| no | Norwegian | 4,815,797 | 0 | 0 | 0 | 0 | 0 | 4,815,797 + no  | Norwegian |  4 815 |  0 |  0 |  0 |  0 |  0 |  4 815 | 
-| pl | Polish | 17,516,332 | 0 | 2,378,025 | 20,627,627 | 12,811,143 | 26,572,483 | 79,905,610 + pl  | Polish |  17 516 |  0 |  2 378 |  20 627 |  12 811 |  26 572 |  79 905 | 
-| pt | Portuguese | 2,393,287 | 369,434 | 2,999,903 | 28,602,556 | 16,484,692 | 43,391,919 | 94,241,791 + pt  | Portuguese |  2 393 |  369 |  2 999 |  28 602 |  16 484 |  43 391 |  94 241 | 
-| ro | Romanian | 3,432,615 | 0 | 2,737,807 | 8,199,565 | 9,446,369 | 34,128,511 | 57,944,867 + ro  | Romanian |  3 432 |  0 |  2 737 |  8 199 |  9 446 |  34 128 |  57 944 | 
-| ru | Russian | 3,337,545 | 3,174,152 | 0 | 0 | 0 | 6,885,753 | 13,397,450 + ru  | Russian |  3 337 |  3 174 |  0 |  0 |  0 |  6 885 |  13 397 | 
-| sk | Slovak | 7,401,998 | 0 | 0 | 19,222,784 | 12,734,444 | 5,134,150 | 44,493,376 + sk  | Slovak |  7 401 |  0 |  0 |  19 222 |  12 734 |  5 134 |  44 493 | 
-| sl | Slovenian | 900,221 | 0 | 0 | 19,645,598 | 12,240,548 | 17,024,593 | 49,810,960 + sl  | Slovenian |  900 |  0 |  0 |  19 645 |  12 240 |  17 024 |  49 810 | 
-| sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2,003,579 | 2,003,579 + sq  | Albanian |  0 |  0 |  0 |  0 |  0 |  2 003 |  2 003 | 
-| sr | Serbian | 8,823,894 | 0 | 0 | 0 | 0 | 20,776,850 | 29,600,744 + sr  | Serbian |  8 823 |  0 |  0 |  0 |  0 |  20 776 |  29 600 | 
-| sv | Swedish | 8,138,161 | 0 | 0 | 20,585,800 | 13,840,373 | 14,693,861 | 57,258,195 + sv  | Swedish |  8 138 |  0 |  0 |  20 585 |  13 840 |  14 693 |  57 258 | 
-| tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21,190,828 | 21,190,828 + tr  | Turkish |  0 |  0 |  0 |  0 |  0 |  21 190 |  21 190 | 
-| uk | Ukrainian | 5,054,034 | 0 | 0 | 0 | 0 | 246,059 | 5,300,093 + uk  | Ukrainian |  5 054 |  0 |  0 |  0 |  0 |  246 |  5 300 | 
-| vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1,473,591 | 1,473,591 + vi  | Vietnamese |  0 |  0 |  0 |  0 |  0 |  1 473 |  1 473 | 
-| **Subtotal** |  | 194,055,340 | 20,769,694 | 24,676,725 | 430,195,134 | 265,029,289 | 488,372,783 | 1,423,098,965 +| **Subtotal** |  |  194 055 |  20 769 |  24 676 |  430 195 |  265 029 |  488 372 |  1 423 098 | 
-| cs | Czech | 84,718,325 | 3,416,272 | 2,315,118 | 20,303,101 | 12,922,658 | 50,688,186 | 174,363,660 + cs  | Czech |  84 718 |  3 416 |  2 315 |  20 303 |  12 922 |  50 688 |  174 363 | 
-| **TOTAL** |  | 278,773,665 | 24,185,966 | 26,991,843 | 450,498,235 | 277,951,947 | 539,060,969 | 1,597,462,625 |+| **TOTAL** |  |  278 773 |  24 185 |  26 991 |  450 498 |  277 951 |  539 060 |  1 597 462 |
  
 N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Line 138: Line 146:
  
  
 +====Structural attributes====
 +
 +^Structure^Attribute^Description^Values^
 +|doc|doc.id|unique document identifier|text|
 +| |doc.lang|language|ar / be / bg / ca / cs / da / de / el / en / es / et / fi / fr / he / hi / hr / hu / is / it / ja / lt / lv / mk / ms / mt / nb / nl / no / pl / pt / ro / ru / sk / sl / sq / sr / sv / sy / tr / uk / vi / zh|
 +| |doc.version|version|number|
 +| |doc.wordcount|document size in words|number|
 +|div|div.id|text identification|author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE|
 +| |div.group|division in|//Core// / Acquis / Europarl / PressEurop / Subtitles / Syndicate|
 +| |div.wordcount|number of words|number|
 +| |div.author|author|last name, first name|
 +| |div.title|full title|text|
 +| |div.publisher|publisher|text|
 +| |div.pubplace|publication place|text|
 +| |div.pubyear|publication year|date|
 +| |div.txtype|text type|discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles|
 +| |div.original|is the text an original?|Yes / No|
 +| |div.srclang|language of the original|ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh|
 +| |div.translator|translator|last name, first name|
 +| |div.transsex|translator's gender|F / M|
 +| |div.authsex|author's gender|F / M|
 +|p|p.id|unique paragraph identifier|text|
 +|s|s.id|unique sentence identifier|text|
 +
 +
 +====Number of texts in the core of the corpus by languages of the text and languages of the original====
 +
 +^ ^  Language of the original  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ ^
 +^ ↓ Language of the text ^ ar ^ be ^ bg ^ ca ^ cs ^ da ^ de ^ en ^ es ^ fi ^ fr ^ hi ^ hr ^ hu ^ it ^ lt ^ lv ^ mk ^ nl ^ no ^ pl ^ pt ^ ro ^ ru ^ sk ^ sl ^ sr ^ sv ^ uk ^ total ^ other ^
 +^ ar |  1 |        1 |    1 |                                              3 |   |
 +^ be |    3 |      8 |    4 |  13 |  1 |    1 |    1 |                3 |      2 |  1 |    1 |  1 |    39 |   |
 +^ bg |      19 |    9 |    1 |  27 |      4 |        2 |            1 |  1 |    2 |        2 |    68 |   |
 +^ ca |        1 |  16 |    3 |  12 |  5 |  1 |  2 |        3 |              1 |    1 |            45 |  1 |
 +^ cs |  1 |  3 |  19 |  1 |  267 |  9 |  134 |  242 |  127 |  24 |  95 |  2 |  26 |  1 |  20 |  1 |  7 |  1 |  30 |  7 |  49 |  21 |    39 |  56 |  3 |  8 |  58 |  6 |  1257 |   |
 +^ da |          6 |  9 |    12 |                                            27 |   |
 +^ de |          85 |    126 |  65 |  10 |  1 |  4 |      1 |  7 |  1 |  1 |    6 |  3 |  3 |  2 |    3 |  1 |    3 |  5 |    327 |   |
 +^ en |          25 |    4 |  125 |      3 |        1 |        2 |    1 |  1 |    6 |      5 |  4 |    177 |  1 |
 +^ es |        1 |  25 |    8 |  29 |  126 |  1 |  6 |        7 |          1 |    4 |    2 |        3 |    213 |  1 |
 +^ fi |          11 |  1 |  1 |  12 |  2 |  25 |          1 |          1 |    1 |            2 |    57 |  1 |
 +^ fr |          36 |    1 |  10 |      83 |        2 |        1 |      2 |    2 |            137 |   |
 +^ hi |          2 |      1 |      1 |  2 |                    1 |                7 |   |
 +^ hr |      1 |    71 |    15 |  52 |  11 |  2 |  4 |    26 |    6 |        7 |  1 |  3 |  4 |    1 |    1 |    8 |    213 |  2 |
 +^ hu |          16 |    5 |  23 |      9 |        1 |              3 |    14 |            71 |   |
 +^ it |          4 |    4 |  21 |  9 |  1 |  3 |        19 |              3 |    1 |        3 |    68 |  1 |
 +^ lt |          8 |    2 |  2 |                1 |  1 |        2 |        1 |          17 |   |
 +^ lv |          22 |    2 |  1 |                1 |  7 |        2 |        1 |          36 |   |
 +^ mk |          15 |    1 |  16 |      1 |    1 |    1 |      2 |  1 |    3 |      2 |      2 |  4 |    49 |   |
 +^ nl |          24 |    3 |  33 |  7 |    3 |        3 |        30 |  2 |  2 |  3 |    3 |        6 |    119 |   |
 +^ no |          11 |    5 |  21 |  4 |    1 |        3 |          6 |    2 |            1 |    54 |   |
 +^ pl |          36 |    8 |  97 |  10 |  2 |  8 |        2 |  1 |  1 |    3 |  1 |  46 |  4 |    6 |  1 |      5 |    231 |  1 |
 +^ pt |          6 |      8 |                            15 |                29 |   |
 +^ ro |          7 |    5 |  12 |  3 |    1 |    1 |    1 |            1 |  1 |          1 |      33 |  3 |
 +^ ru |          9 |    1 |  22 |      2 |                1 |    1 |      22 |      1 |  3 |    62 |  1 |
 +^ sk |          55 |    2 |  5 |  1 |                1 |        2 |        56 |          122 |  18 |
 +^ sl |          7 |    1 |  2 |          1 |                          2 |    2 |    15 |   |
 +^ sr |          11 |    7 |  33 |  9 |    3 |        7 |        2 |    4 |  3 |    10 |  1 |    5 |  2 |    97 |  3 |
 +^ sv |          11 |    4 |  23 |  7 |    2 |        1 |        1 |                  50 |    99 |  1 |
 +^ uk |          6 |    1 |  31 |  3 |    5 |        2 |            5 |      3 |        5 |  6 |  67 |   |
 +^ total |  2 |  6 |  39 |  3 |  810 |  19 |  349 |  950 |  335 |  57 |  241 |  4 |  56 |  2 |  89 |  5 |  18 |  3 |  84 |  22 |  128 |  72 |    119 |  118 |  6 |  26 |  164 |  12 |     |
 +
 +  * The table shows number of texts in the core of Intercorp.
 +  * For each language which has texts in the core, number of texts by languages of the original (written in the caption) are shown. E. g. in Arabian, there is one Arabian, one Czech and one German original text in the core, that is total of three texts in Arabian (see the penultimate column).
 +  * You can find out in columns, how many original texts in a language written in the caption are translated to other languages. Codes of these languages are in the first column. The last column shows the number of original texts in other languages, which are not in the core of Intercorp.
 +  * In the diagonal, there is a number of original texts in a given language. E. g. in Hungarian and Romanian, there is none, in Romanian not even a translated one.
  
  
Line 147: Line 219:
  
   * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to  Adrian Barentsen   * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to  Adrian Barentsen
-  * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]]\\ {{:cnk:intercorp:projectsyndicate.png?direct&319}}+  * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]]
   * Newspaper texts in a number of languages from the [[http://www.voxeurop.eu|Presseurop/VoxEurop]] server   * Newspaper texts in a number of languages from the [[http://www.voxeurop.eu|Presseurop/VoxEurop]] server
   * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus   * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus
Line 182: Line 254:
  
  
- 
- 
-===== Citing InterCorp ===== 
- 
-<WRAP round tip 70%> 
-Rosen, A. – Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), version 7 from 19 Dec 2014//. Ústav Českého národního korpusu FF UK, Praha 2014. Available on-line: http://www.korpus.cz 
- 
-Čermák, F. – Rosen, A. (2012): The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//, 17(3), 411–427. 
-</WRAP>