Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
en:cnk:intercorp:verze9 [2016/06/30 16:56] – created Adrian Zasina | en:cnk:intercorp:verze9 [2019/10/06 20:43] (current) – [Taggers/lemmatizers:] Michal Škrabal | ||
---|---|---|---|
Line 6: | Line 6: | ||
<WRAP right> | <WRAP right> | ||
^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ | ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ | ||
- | ^ Positions ^ Number of tokens | 120 443 181 | 117 981 673 | 278 445 878 | 1 556 840 965 | | + | ^ Positions ^ Number of tokens | 120,443,181 | 117,981,673 | 278,445,878 | 1,556,840,965 | |
- | ^ ::: ^ Number of word forms | 96 956 714 | 89 645 545 | 231 501 606 | 1 228 896 294 | | + | ^ ::: ^ Number of word forms | 96,956,714 | 89,645,545 | 231,501,606 | 1,228,896,294 | |
- | ^ Structural attributes ^ Number of documents | 1430 | 5 | 2 934 | 89 | | + | ^ Structural attributes ^ Number of documents | 1430 | 5 | 2,934 | 89 | |
- | ^ ::: ^ Number of div | 1 430 | 111 263 | 2 934 | 1 849 184 | | + | ^ ::: ^ Number of div | 1,430 | 111,263 | 2,934 | 1,849,184 | |
- | ^ ::: ^ Number of sentences | 8 308 814 | 13 588 082 | 17 210 601 | 143 478 514 | | + | ^ ::: ^ Number of sentences | 8,308,814 | 13,588,082 | 17,210,601 | 143,478,514 | |
^ Further information ^ reference | YES ^^^^ | ^ Further information ^ reference | YES ^^^^ | ||
^ ::: ^ representative | NO ^^^^ | ^ ::: ^ representative | NO ^^^^ | ||
Line 25: | Line 25: | ||
After [[http:// | After [[http:// | ||
- | InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http:// | + | InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http:// |
After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. | After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. | ||
New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). | New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). | ||
- | |||
===== References ===== | ===== References ===== | ||
Line 44: | Line 43: | ||
When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, | When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, | ||
- | Rosen, A., Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), | + | Rosen, A., Vavřín, M.: //Korpus InterCorp – English, German((Insert actually used languages.)), |
</ | </ | ||
Line 51: | Line 50: | ||
The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The choice in the present release includes: | The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The choice in the present release includes: | ||
- | * Political commentaries published by [[http:// | + | * Political commentaries published by [[http:// |
* A package of legal texts of the European Union form the [[http:// | * A package of legal texts of the European Union form the [[http:// | ||
* Proceedings of the European Parliament dated 2007–2011 from the [[http:// | * Proceedings of the European Parliament dated 2007–2011 from the [[http:// | ||
* Film subtitles from the [[http:// | * Film subtitles from the [[http:// | ||
- | These texts have been aligned automatically: | + | These texts have been aligned automatically: |
- | Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), | + | Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), |
Line 66: | Line 65: | ||
[{{: | [{{: | ||
- | |||
===== Corpus size in thousands of words ===== | ===== Corpus size in thousands of words ===== | ||
^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ | ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ | ||
- | | ar | Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 34 | | + | | ar | Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 34 | |
- | | be | Belarusian | 3 025 | 0 | 0 | 0 | 0 | 0 | 3 025 | | + | | be | Belarusian | 3,025 | 0 | 0 | 0 | 0 | 0 | 3,025 | |
- | | bg | Bulgarian | 6 007 | 0 | 0 | 13 816 | 9 083 | 0 | 28 907 | | + | | bg | Bulgarian | 6,007 | 0 | 0 | 13,816 | 9,083 | 0 | 28,907 | |
- | | ca | Catalan | 4 632 | 0 | 0 | 0 | 0 | 0 | 4 632 | | + | | ca | Catalan | 4,632 | 0 | 0 | 0 | 0 | 0 | 4,632 | |
- | | da | Danish | 3 556 | 0 | 0 | 21 679 | 13 915 | 14 429 | 53 581 | | + | | da | Danish | 3,556 | 0 | 0 | 21,679 | 13,915 | 14,429 | 53,581 | |
- | | de | German | 31 168 | 3 725 | 2 482 | 21 723 | 13 089 | 8 366 | 80 556 | | + | | de | German | 31,168 | 3,725 | 2,482 | 21,723 | 13,089 | 8,366 | 80,556 | |
- | | el | Greek | 0 | 0 | 0 | 25 069 | 15 403 | 23 714 | 64 187 | | + | | el | Greek | 0 | 0 | 0 | 25,069 | 15,403 | 23,714 | 64,187 | |
- | | en | English | 21 208 | 3 818 | 2 670 | 24 207 | 15 580 | 52 101 | 119 586 | | + | | en | English | 21,208 | 3,818 | 2,670 | 24,207 | 15,580 | 52,101 | 119,586 | |
- | | es | Spanish | 19 310 | 4 324 | 2 816 | 27 001 | 15 885 | 36 378 | 105 716 | | + | | es | Spanish | 19,310 | 4,324 | 2,816 | 27,001 | 15,885 | 36,378 | 105,716 | |
- | | et | Estonian | 0 | 0 | 0 | 15 962 | 10 899 | 10 296 | 37 158 | | + | | et | Estonian | 0 | 0 | 0 | 15,962 | 10,899 | 10,296 | 37,158 | |
- | | fi | Finnish | 3 645 | 0 | 0 | 16 455 | 10 175 | 15 097 | 45 373 | | + | | fi | Finnish | 3,645 | 0 | 0 | 16,455 | 10,175 | 15,097 | 45,373 | |
- | | fr | French | 12 406 | 4 393 | 2 928 | 27 351 | 17 178 | 25 961 | 90 219 | | + | | fr | French | 12,406 | 4,393 | 2,928 | 27,351 | 17,178 | 25,961 | 90,219 | |
- | | he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16 221 | 16 221 | | + | | he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16,221 | 16,221 | |
| hi | Hindu | 408 | 0 | 0 | 0 | 0 | 0 | 408 | | | hi | Hindu | 408 | 0 | 0 | 0 | 0 | 0 | 408 | | ||
- | | hr | Croatian | 19 980 | 0 | 0 | 0 | 0 | 19 042 | 39 023 | | + | | hr | Croatian | 19,980 | 0 | 0 | 0 | 0 | 19,042 | 39 023 | |
- | | hu | Hungarian | 5 818 | 0 | 0 | 19 176 | 12 306 | 21 239 | 58 541 | | + | | hu | Hungarian | 5,818 | 0 | 0 | 19,176 | 12,306 | 21,239 | 58,541 | |
- | | is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1 584 | 1 584 | | + | | is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1,584 | 1,584 | |
- | | it | Italian | 8 694 | 651 | 2 707 | 24 849 | 15 489 | 14 653 | 67 046 | | + | | it | Italian | 8,694 | 651 | 2,707 | 24,849 | 15,489 | 14,653 | 67,046 | |
| ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113 | 113 | | | ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113 | 113 | | ||
- | | lt | Lithuanian | 358 | 0 | 0 | 18 392 | 11 212 | 557 | 30 521 | | + | | lt | Lithuanian | 358 | 0 | 0 | 18,392 | 11,212 | 557 | 30,521 | |
- | | lv | Latvian | 1 336 | 0 | 0 | | + | | lv | Latvian | 1,666 | 0 | 0 | |
- | | mk | Macedonian | 4 663 | 0 | 0 | 0 | 0 | 1 877 | 6 540 | | + | | mk | Macedonian | 4,663 | 0 | 0 | 0 | 0 | 1,877 | 6,540 | |
- | | ms | Malay | 0 | 0 | 0 | 0 | 0 | 3 520 | 3 520 | | + | | ms | Malay | 0 | 0 | 0 | 0 | 0 | 3,520 | 3,520 | |
- | | mt | Maltese | 0 | 0 | 0 | 14 133 | 0 | 0 | 14 133 | | + | | mt | Maltese | 0 | 0 | 0 | 14,133 | 0 | 0 | 14,133 | |
- | | nl | Dutch | 11 444 | 314 | 2 955 | 24 746 | 15 563 | 29 362 | 84 386 | | + | | nl | Dutch | 11,444 | 314 | 2,955 | 24,746 | 15,563 | 29,362 | 84,386 | |
- | | no | Norwegian | 4 965 | 0 | 0 | 0 | 0 | 0 | 4 965 | | + | | no | Norwegian | 4,965 | 0 | 0 | 0 | 0 | 0 | 4,965 | |
- | | pl | Polish | 21 433 | 0 | 2 378 | 20 627 | 12 811 | 26 572 | 83 822 | | + | | pl | Polish | 21,433 | 0 | 2,378 | 20,627 | 12, | 26,572 | 83,822 | |
- | | pt | Portuguese | 2 605 | 369 | 2 999 | 28 602 | 16 484 | 43 391 | 94 454 | | + | | pt | Portuguese | 2,605 | 369 | 2,999 | 28,602 | 16,484 | 43,391 | 94,454 | |
| rn | Romani | 5 | 0 | 0 | 0 | 0 | 0 | 5 | | | rn | Romani | 5 | 0 | 0 | 0 | 0 | 0 | 5 | | ||
- | | ro | Romanian | 3 432 | 0 | 2 737 | 8 199 | 9 446 | 34 128 | 57 944 | | + | | ro | Romanian | 3,432 | 0 | 2,737 | 8,199 | 9,446 | 34,128 | 57,944 | |
- | | ru | Russian | 4 788 | 3 174 | 0 | 0 | 0 | 6 885 | 14 848 | | + | | ru | Russian | 4,788 | 3,174 | 0 | 0 | 0 | 6,885 | 14,848 | |
- | | sk | Slovak | 8 066 | 0 | 0 | 19 222 | 12 734 | 5 134 | 45 158 | | + | | sk | Slovak | 8,066 | 0 | 0 | 19,222 | 12,734 | 5,134 | 45,158 | |
- | | sl | Slovenian | 2 057 | 0 | 0 | 19 645 | 12 240 | 17 024 | 50 968 | | + | | sl | Slovenian | 2,057 | 0 | 0 | 19,645 | 12,240 | 17,024 | 50,968 | |
- | | sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2 003 | 2 003 | | + | | sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2,003 | 2,003 | |
- | | sr | Serbian | 9 886 | 0 | 0 | 0 | 0 | 20 720 | 30 607 | | + | | sr | Serbian | 9,886 | 0 | 0 | 0 | 0 | 20,720 | 30,607 | |
- | | sv | Swedish | 8 959 | 0 | 0 | 20 585 | 13 840 | 14 693 | 58 079 | | + | | sv | Swedish | 8,959 | 0 | 0 | 20,585 | 13,840 | 14,693 | 58,079 | |
- | | tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21 190 | 21 190 | | + | | tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21,190 | 21,190 | |
- | | uk | Ukrainian | 7 597 | 0 | 0 | 0 | 0 | 246 | 7 843 | | + | | uk | Ukrainian | 7,597 | 0 | 0 | 0 | 0 | 246 | 7,843 | |
- | | vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1 473 | 1 473 | | + | | vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1,473 | 1,473 | |
- | | **Subtotal** | | 231 501 | 20 769 | 24 676 | 430 160 | 265 022 | 488 266 | 1 460 397 | | + | | **Subtotal** | | 231,501 | 20,769 | 24,676 | 430,160 | 265,022 | 488,266 | 1,460,397 | |
- | | cs | Czech | 96 956 | 3 416 | 2 315 | 20 303 | 12 922 | 50 688 | 186 602 | | + | | cs | Czech | 96,956 | 3,416 | 2,315 | 20,303 | 12,922 | 50,688 | 186,602 | |
- | | **TOTAL** | | 328 458 | 24 186 | 26 991 | 450 463 | 277 945 | 538 954 | 1 647 000 | | + | | **TOTAL** | | 328,458 | 24,186 | 26,991 | 450,463 | 277,945 | 538,954 | 1,647,000 | |
N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. | N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. | ||
Line 124: | Line 122: | ||
^ Croatian | ✔ | ✔ | | ^ Croatian | ✔ | ✔ | | ||
^ Czech | ✔ | ✔ | [[http:// | ^ Czech | ✔ | ✔ | [[http:// | ||
- | ^ Dutch | ✔ | | + | ^ Dutch | ✔ | |
^ English | ✔ | ^ English | ✔ | ||
^ Estonian | ✔ | ✔ | [[http:// | ^ Estonian | ✔ | ✔ | [[http:// | ||
- | ^ Finnish | ✔ | ✔ | | + | ^ Finnish | ✔ | ✔ | |
^ French | ✔ | ✔ | [[http:// | ^ French | ✔ | ✔ | [[http:// | ||
- | ^ German | ✔ | ✔ | [[http:// | + | ^ German | ✔ | ✔ | [[http:// |
^ Hungarian | ✔ | | ^ Hungarian | ✔ | | ||
^ Icelandic | ✔ | ✔ | [[http:// | ^ Icelandic | ✔ | ✔ | [[http:// | ||
Line 144: | Line 142: | ||
^ Spanish | ✔ | ✔ | [[ftp:// | ^ Spanish | ✔ | ✔ | [[ftp:// | ||
^ Swedish | ✔ | ✔ | [[http:// | ^ Swedish | ✔ | ✔ | [[http:// | ||
+ | |||
Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as // | Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as // | ||
Morphological tags including characters with a special meaning in regular expressions, | Morphological tags including characters with a special meaning in regular expressions, | ||
- | |||
- | |||
====Structural attributes==== | ====Structural attributes==== | ||
Line 196: | Line 193: | ||
==== Pre-processing ==== | ==== Pre-processing ==== | ||
- | * parallel | + | * Parallel |
* Aligner [[http:// | * Aligner [[http:// | ||
* Sentence splitter for Czech by Pavel Květoň | * Sentence splitter for Czech by Pavel Květoň | ||
Line 215: | Line 212: | ||
* [[https:// | * [[https:// | ||
* [[http:// | * [[http:// | ||
+ | * [[https:// | ||
+ | * [[https:// | ||