AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze16ud [2024/09/30 14:29] – [Main features of release 16ud] alexandrrosenen:cnk:intercorp:verze16ud [2024/10/18 20:41] (current) – [References – about UD-annotated InterCorp] alexandrrosen
Line 32: Line 32:
   * In release 16ud, out of the total number of 62 languages ​​(including Czech), **47 are linguistically annotated**; in addition, all such languages ​​are **syntactically annotated**.   * In release 16ud, out of the total number of 62 languages ​​(including Czech), **47 are linguistically annotated**; in addition, all such languages ​​are **syntactically annotated**.
   * Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org|Universal Dependencies]]).   * Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org|Universal Dependencies]]).
-  * Annotation was performed for all languages ​​by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((The tool uses all data for the given language, ie all treebanks listed on [[https://lindat.mff.cuni.cz/services/udpipe/UDPipe]]. Annotation of this release used the following models: afrikaans-afribooms-ud-2.12-230717, +  * Annotation was performed for all languages ​​by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((Annotation of this release used the following models: afrikaans-afribooms-ud-2.12-230717, 
 arabic-padt-ud-2.12-230717,  arabic-padt-ud-2.12-230717, 
 armenian-armtdp-ud-2.12-230717,  armenian-armtdp-ud-2.12-230717, 
Line 95: Line 95:
 In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
  
-Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16ud published in September 2024 is 4 746 mil. words. This number includes 382 mil. words in the aligned foreign language texts in the core part and 4 746 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus is shown in the following charts. The charts show the volumes in millions of words.+Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16ud published in September 2024 is 5 257 mil. words. This number includes 382 mil. words in the aligned foreign language texts in the core part and 4 746 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus is shown in the following charts. The charts show the volumes in millions of words.
  
  
Line 187: Line 187:
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]|    1|   6 556|   6 594.8|   32 635.9|   38 097.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]|    1|   6 556|   6 594.8|   32 635.9|   38 097.3|
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|    117|   116 660|   25 976.1|   123 357.7|   165 696.1| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|    117|   116 660|   25 976.1|   123 357.7|   165 696.1|
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|    310|   138 571|   33 957.7|   258 555.1|   315 325.2| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]|    310|   138 571|   33 957.7|   258 555.1|   315 325.2| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]|    1|    146|    121.7|    622.1|    797.9| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]|    1|    146|    121.7|    622.1|    797.9| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]|    1|   33 935|   27 608.8|   129 458.6|   172 973.7| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]|    1|   33 935|   27 608.8|   129 458.6|   172 973.7| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]|    8|    61|    116.6|    832.7|    988.1| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]|    8|    61|    116.6|    832.7|    988.1| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]|    327|   35 447|   30 758.6|   162 943.8|   208 413.5| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]|    327|   35 447|   30 758.6|   162 943.8|   208 413.5| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]|    13|    13|    41.6|    466.3|    586.3| +^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]|    13|    13|    41.6|    466.3|    586.3| 
-^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]|    95|   125 933|   34 510.0|   178 525.6|   240 411.9| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]|    95|   125 933|   34 510.0|   178 525.6|   240 411.9| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]|    1|    7|    3.9|    23.5|    30.6| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]|    1|    7|    3.9|    23.5|    30.6| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]|    1|   8 350|   8 112.7|   37 824.9|   49 694.7| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]|    1|   8 350|   8 112.7|   37 824.9|   49 694.7| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]|    1|   1 135|   1 497.9|   7 374.2|   9 299.9| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]|    1|   1 135|   1 497.9|   7 374.2|   9 299.9| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]|    194|   134 401|   33 361.2|   226 224.9|   286 343.4| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]|    194|   134 401|   33 361.2|   226 224.9|   286 343.4| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]|    37|   2 363|   2 296.7|   16 138.6|   18 020.3| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]|    37|   2 363|   2 296.7|   16 138.6|   18 020.3| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]|    1|    204|    198.4|    871.1|   1 179.0| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]|    1|    204|    198.4|    871.1|   1 179.0| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]|    1|    4|    4.1|    13.9|    19.2| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]|    1|    4|    4.1|    13.9|    19.2| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]|    1|   1 605|   1 641.1|   5 964.3|   7 294.3| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]|    1|   1 605|   1 641.1|   5 964.3|   7 294.3| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]|    28|   87 642|   3 622.1|   34 786.3|   45 134.4| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]|    28|   87 642|   3 622.1|   34 786.3|   45 134.4| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]|    78|   86 356|   3 023.6|   35 425.1|   45 293.5| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]|    78|   86 356|   3 023.6|   35 425.1|   45 293.5| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]|    109|   3 541|   3 907.8|   23 993.1|   30 898.6| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]|    109|   3 541|   3 907.8|   23 993.1|   30 898.6| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]|    1|    285|    365.3|   1 258.4|   1 793.5| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]|    1|    285|    365.3|   1 258.4|   1 793.5| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]|    1|   1 496|   1 712.1|   7 828.0|   10 573.3| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]|    1|   1 496|   1 712.1|   7 828.0|   10 573.3| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]|    1|   8 963|    784.8|   13 805.0|   16 643.6| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]|    1|   8 963|    784.8|   13 805.0|   16 643.6| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]|    232|   132 791|   33 065.4|   233 111.3|   284 402.6| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]|    232|   132 791|   33 065.4|   233 111.3|   284 402.6| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]|    105|   9 163|   8 344.6|   48 750.2|   61 120.3| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]|    105|   9 163|   8 344.6|   48 750.2|   61 120.3| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]|    360|   140 055|   41 282.4|   227 242.6|   300 207.8| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]|    360|   140 055|   41 282.4|   227 242.6|   300 207.8| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]|    107|   147 063|   46 510.1|   280 566.2|   355 121.8| +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]|    107|   147 063|   46 510.1|   280 566.2|   355 121.8| 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]|    2|    2|    1.7|    13.6|    17.7| +^[[https://en.wikipedia.org/wiki/Romani_language|rn]]|    2|    2|    1.7|    13.6|    17.7| 
-^[[https://en.wikipedia.org/wiki/Romani_language|rn]]|    55|   102 904|   39 561.2|   235 702.3|   295 301.3|+^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ru|ru]]|    55|   102 904|   39 561.2|   235 702.3|   295 301.3|
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]|    184|   32 839|   22 985.2|   122 130.4|   163 120.7| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]|    184|   32 839|   22 985.2|   122 130.4|   163 120.7|
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]|    1|    499|    522.5|   2 313.4|   3 021.8| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]|    1|    499|    522.5|   2 313.4|   3 021.8|
Line 254: Line 254:
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]|  – |  – |  – |  – |  – |  – |  – |    32 635.9|  – ^    32 635.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]|  – |  – |  – |  – |  – |  – |  – |    32 635.9|  – ^    32 635.9^
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|    6 714.9|     44.4|     200.5|    15 264.2|     542.6|    10 109.3|  – |    90 481.8|  – ^    123 357.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|    6 714.9|     44.4|     200.5|    15 264.2|     542.6|    10 109.3|  – |    90 481.8|  – ^    123 357.7^
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|    20 454.4|     194.3|    3 687.5|    26 298.4|     762.6|    17 186.4|    3 044.3|    181 033.4|    5 893.7^    258 555.1^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]|    20 454.4|     194.3|    3 687.5|    26 298.4|     762.6|    17 186.4|    3 044.3|    181 033.4|    5 893.7^    258 555.1^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]|  – |  – |  – |  – |  – |  – |  – |     622.1|  – ^     622.1^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]|  – |  – |  – |  – |  – |  – |  – |     622.1|  – ^     622.1^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]|  – |  – |  – |  – |  – |  – |  – |    129 458.6|  – ^    129 458.6^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]|  – |  – |  – |  – |  – |  – |  – |    129 458.6|  – ^    129 458.6^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]|     402.8|  – |  – |  – |  – |  – |  – |     429.9|  – ^     832.7^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]|     402.8|  – |  – |  – |  – |  – |  – |     429.9|  – ^     832.7^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]|    22 763.6|     242.6|    1 523.4|  – |     569.9|  – |  – |    137 844.3|  – ^    162 943.8^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]|    22 763.6|     242.6|    1 523.4|  – |     569.9|  – |  – |    137 844.3|  – ^    162 943.8^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]|     405.3|     36.6|     24.4|  – |  – |  – |  – |  – |  – ^     466.3^ +^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]|     405.3|     36.6|     24.4|  – |  – |  – |  – |  – |  – ^     466.3^ 
-^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]|    6 890.1|     28.9|  – |    17 851.3|  – |    12 187.9|  – |    141 559.0|     8.4^    178 525.6^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]|    6 890.1|     28.9|  – |    17 851.3|  – |    12 187.9|  – |    141 559.0|     8.4^    178 525.6^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]|  – |  – |  – |  – |  – |  – |  – |     23.5|  – ^     23.5^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]|  – |  – |  – |  – |  – |  – |  – |     23.5|  – ^     23.5^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]|  – |  – |  – |  – |  – |  – |  – |    37 824.9|  – ^    37 824.9^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]|  – |  – |  – |  – |  – |  – |  – |    37 824.9|  – ^    37 824.9^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]|  – |  – |  – |  – |  – |  – |  – |    7 374.2|  – ^    7 374.2^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]|  – |  – |  – |  – |  – |  – |  – |    7 374.2|  – ^    7 374.2^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]|    17 435.8|     50.6|     647.8|    23 892.0|     685.2|    15 511.4|    2 750.7|    163 859.9|    1 391.5^    226 224.9^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]|    17 435.8|     50.6|     647.8|    23 892.0|     685.2|    15 511.4|    2 750.7|    163 859.9|    1 391.5^    226 224.9^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]|    3 766.7|     64.9|     163.1|  – |  – |  – |  – |    12 141.5|     2.5^    16 138.6^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]|    3 766.7|     64.9|     163.1|  – |  – |  – |  – |    12 141.5|     2.5^    16 138.6^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]|  – |  – |  – |  – |  – |  – |  – |     871.1|  – ^     871.1^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]|  – |  – |  – |  – |  – |  – |  – |     871.1|  – ^     871.1^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]|  – |  – |  – |  – |  – |  – |  – |     13.9|  – ^     13.9^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]|  – |  – |  – |  – |  – |  – |  – |     13.9|  – ^     13.9^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]|  – |  – |  – |  – |  – |  – |  – |    5 964.3|  – ^    5 964.3^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]|  – |  – |  – |  – |  – |  – |  – |    5 964.3|  – ^    5 964.3^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]|     669.1|     7.2|     17.4|    17 175.1|     471.2|    11 198.5|  – |    5 247.7|  – ^    34 786.3^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]|     669.1|     7.2|     17.4|    17 175.1|     471.2|    11 198.5|  – |    5 247.7|  – ^    34 786.3^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]|    3 207.6|     362.1|     66.9|    17 519.4|     536.7|    11 682.0|  – |    2 050.4|  – ^    35 425.1^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]|    3 207.6|     362.1|     66.9|    17 519.4|     536.7|    11 682.0|  – |    2 050.4|  – ^    35 425.1^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]|    8 794.5|     86.5|  – |  – |  – |  – |  – |    15 112.0|  – ^    23 993.1^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]|    8 794.5|     86.5|  – |  – |  – |  – |  – |    15 112.0|  – ^    23 993.1^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]|  – |  – |  – |  – |  – |  – |  – |    1 258.4|  – ^    1 258.4^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]|  – |  – |  – |  – |  – |  – |  – |    1 258.4|  – ^    1 258.4^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]|  – |  – |  – |  – |  – |  – |  – |    7 828.0|  – ^    7 828.0^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]|  – |  – |  – |  – |  – |  – |  – |    7 828.0|  – ^    7 828.0^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]|  – |  – |  – |    13 805.0|  – |  – |  – |  – |  – ^    13 805.0^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]|  – |  – |  – |    13 805.0|  – |  – |  – |  – |  – ^    13 805.0^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]|    17 229.8|     356.4|    1 193.5|    23 401.1|     716.8|    15 555.9|    2 952.8|    170 892.9|     812.1^    233 111.3^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]|    17 229.8|     356.4|    1 193.5|    23 401.1|     716.8|    15 555.9|    2 952.8|    170 892.9|     812.1^    233 111.3^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]|    7 690.7|     138.1|     392.0|  – |     723.9|  – |  – |    39 805.6|  – ^    48 750.2^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]|    7 690.7|     138.1|     392.0|  – |     723.9|  – |  – |    39 805.6|  – ^    48 750.2^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]|    27 056.2|     283.2|     754.2|    19 482.9|     576.1|    12 662.8|    2 367.5|    164 059.8|  – ^    227 242.6^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]|    27 056.2|     283.2|     754.2|    19 482.9|     576.1|    12 662.8|    2 367.5|    164 059.8|  – ^    227 242.6^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]|    7 204.0|     81.3|  – |    24 385.0|     706.2|    15 188.4|    2 782.5|    229 480.2|     738.5^    280 566.2^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]|    7 204.0|     81.3|  – |    24 385.0|     706.2|    15 188.4|    2 782.5|    229 480.2|     738.5^    280 566.2^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]|     8.4|     5.2|  – |  – |  – |  – |  – |  – |  – ^     13.6^ +^[[https://en.wikipedia.org/wiki/Romani_language|rn]]|     8.4|     5.2|  – |  – |  – |  – |  – |  – |  – ^     13.6^ 
-^[[https://en.wikipedia.org/wiki/Romani_language|rn]]|    4 132.6|     64.1|  – |    8 043.5|  – |    9 426.4|    2 725.2|    211 310.4|  – ^    235 702.3^ +^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]|    4 132.6|     64.1|  – |    8 043.5|  – |    9 426.4|    2 725.2|    211 310.4|  – ^    235 702.3^ 
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]|    11 757.6|     143.8|     518.7|  – |     565.5|  – |  – |    104 831.9|    4 312.8^    122 130.4^+^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ru|ru]]|    11 757.6|     143.8|     518.7|  – |     565.5|  – |  – |    104 831.9|    4 312.8^    122 130.4^
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]|  – |  – |  – |  – |  – |  – |  – |    2 313.4|  – ^    2 313.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]|  – |  – |  – |  – |  – |  – |  – |    2 313.4|  – ^    2 313.4^
 ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sk|sk]]|    7 626.6|     402.2|     558.0|    18 398.8|     560.8|    12 727.0|  – |    34 589.4|  – ^    74 862.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sk|sk]]|    7 626.6|     402.2|     558.0|    18 398.8|     560.8|    12 727.0|  – |    34 589.4|  – ^    74 862.7^
Line 300: Line 300:
 ==== Detailed statistics ==== ==== Detailed statistics ====
  
-In addition to the corpus size date, the table includes also measures of statistical complexity and diversity. For languages without linguistic annotation, the table shows only the wordform-based measure of lexical diversity (lexDivWord).+In addition to the corpus size data, the table includes also measures of statistical complexity and lexical diversity. For languages without linguistic annotation, the table shows only the wordform-based measure of lexical diversity (lexDivWord).
  
 ^  [[https://en.wikipedia.org/wiki/ISO_639-1|Lang]]  ^  Collection  ^  Number of  ^^  Thousands of  ^^^  [[en:pojmy:lexikalni_bohatost|Lexical diversity]]  ^^  [[en:pojmy:syntakticka_komplexita|Syntactic complexity]] (average)  ^^^^^^ ^  [[https://en.wikipedia.org/wiki/ISO_639-1|Lang]]  ^  Collection  ^  Number of  ^^  Thousands of  ^^^  [[en:pojmy:lexikalni_bohatost|Lexical diversity]]  ^^  [[en:pojmy:syntakticka_komplexita|Syntactic complexity]] (average)  ^^^^^^
Line 378: Line 378:
 ^:::|Core-misc|  2|  2|  3.5|  44.4|  52.2|  733.0|  532.9|  12.820|  2.148|  1.051|  4.791|  1.821|  2.385| ^:::|Core-misc|  2|  2|  3.5|  44.4|  52.2|  733.0|  532.9|  12.820|  2.148|  1.051|  4.791|  1.821|  2.385|
 ^:::|Acquis|  1|  18 563|  1 310.5|  15 264.2|  19 702.1|  556.9|  380.4|  13.209|  2.369|  0.886|  6.990|  2.588|  2.647| ^:::|Acquis|  1|  18 563|  1 310.5|  15 264.2|  19 702.1|  556.9|  380.4|  13.209|  2.369|  0.886|  6.990|  2.588|  2.647|
-^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|Bible|  2|  66|  48.0|  542.6|  675.3|  529.0|  351.4|  13.324|  1.911|  0.871|  4.231|  1.534|  2.511|+^:::|Bible|  2|  66|  48.0|  542.6|  675.3|  529.0|  351.4|  13.324|  1.911|  0.871|  4.231|  1.534|  2.511|
 ^:::|Europarl|  1|  67 019|  675.6|  10 109.3|  11 838.6|  670.8|  462.7|  15.260|  2.483|  1.242|  6.924|  2.670|  2.395| ^:::|Europarl|  1|  67 019|  675.6|  10 109.3|  11 838.6|  670.8|  462.7|  15.260|  2.483|  1.242|  6.924|  2.670|  2.395|
 ^:::|Subtitles|  1|  30 900|  23 262.2|  90 481.8|  124 969.7|  666.5|  444.7|  3.909|  1.244|  0.242|  1.404|  0.513|  1.689| ^:::|Subtitles|  1|  30 900|  23 262.2|  90 481.8|  124 969.7|  666.5|  444.7|  3.909|  1.244|  0.242|  1.404|  0.513|  1.689|
Line 486: Line 486:
 ^:::|PressEurop|  7|  6 991|  160.6|  2 725.2|  3 192.6|  546.7|  429.5|  17.486|  2.219|  1.017|  8.508|  2.772|  2.492| ^:::|PressEurop|  7|  6 991|  160.6|  2 725.2|  3 192.6|  546.7|  429.5|  17.486|  2.219|  1.017|  8.508|  2.772|  2.492|
 ^:::|Subtitles|  1|  45 407|  38 108.1|  211 310.4|  266 731.5|  509.0|  351.2|  5.572|  1.388|  0.383|  2.129|  0.795|  1.954| ^:::|Subtitles|  1|  45 407|  38 108.1|  211 310.4|  266 731.5|  509.0|  351.2|  5.572|  1.388|  0.383|  2.129|  0.795|  1.954|
-^:::|Core-nonfict|  10|  10|  30.6|  518.7|  625.2|  645.0|  495.9|  17.765|  2.613|  1.223|  8.126|  2.801|  2.603|+^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ru|ru]]|Core-nonfict|  10|  10|  30.6|  518.7|  625.2|  645.0|  495.9|  17.765|  2.613|  1.223|  8.126|  2.801|  2.603|
 ^:::|Core-fiction|  144|  144|  1 043.5|  11 757.6|  14 913.7|  633.0|  501.9|  11.643|  1.959|  0.865|  4.203|  1.557|  2.386| ^:::|Core-fiction|  144|  144|  1 043.5|  11 757.6|  14 913.7|  633.0|  501.9|  11.643|  1.959|  0.865|  4.203|  1.557|  2.386|
 ^:::|Core-misc|  6|  6|  12.8|  143.8|  180.7|  633.2|  484.5|  11.439|  1.947|  0.870|  4.378|  1.718|  2.265| ^:::|Core-misc|  6|  6|  12.8|  143.8|  180.7|  633.2|  484.5|  11.439|  1.947|  0.870|  4.378|  1.718|  2.265|
Line 537: Line 537:
 ===== Metadata ===== ===== Metadata =====
  
-Metadata such as the text's title, author, or source language are available for most texts as attributes of structural elements such as text or sentence. To view the list of such attributes and to select those that should be displayed in the KonText query results, choose the relevant InterCorp 16ud language in the KonText corpus search tool, and then go to ''Structures'' or ''References'' in the  ''Corpus-specific settings''  menu+Metadata such as the text's title, author, or source language are available for most texts as attributes of structural elements such as text or sentence. To view the list of such attributes and to select those that should be displayed in the KonText query results, choose InterCorp 16udthe relevant language, and then n the ''View'' menu select ''Corpus-specific settings'' and go to ''Structures'' or ''References''
  
  
Line 546: Line 546:
  
 ==== Texts: ==== ==== Texts: ====
-  * The latest (13th correctedissue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš.+  * The 13th corrected issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš.
   * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to  Adrian Barentsen   * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to  Adrian Barentsen
   * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]]   * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]]
Line 571: Line 571:
  
 * [[http://ufal.mff.cuni.cz/udpipe|UDPipe]] (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel) * [[http://ufal.mff.cuni.cz/udpipe|UDPipe]] (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel)
 +
 +===== References – about UD-annotated InterCorp =====
 +
 +Rosen, A. (2024): Lexical and syntactic variability
 +of languages and text genres – a corpus-based study. [[https://www.youtube.com/watch?v=E2ujmqt7Q2E|Recording]] from 14 October 2024: [[https://zil.ipipan.waw.pl/|Natural Language Processing Seminar]] organised by the [[https://zil.ipipan.waw.pl|Linguistic Engineering Group]] at the [[https://ipipan.waw.pl|Institute of Computer Science]] [[https://pan.pl|Polish Academy of Sciences]], [[https://zil.ipipan.waw.pl/seminarium-archiwum?action=AttachFile&do=view&target=2024-10-14.pdf|slides]].
 +
 +Olga Nádvorníková (2024): Analyse contrastive de la complexité syntaxique à l’aide de corpus parallèles. Translitteræ, Laboratoire LATTICE (Langues, Textes, Traitements informatiques et Cognition) – CNRS UMR 8094 (Centre national de la recherche scientifique: Unité mixte de recherche), ENS (L'École normale supérieure). Paris, 28/05/2024. [[https://www.youtube.com/watch?v=wJrCez_XPQY|Video]], [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/C4%20Nadvornikova%20Analyse%20contrastiv%20e%20de%20la%20complexité%20syntaxique.pdf|slides]]
 +
 +Alexandr Rosen (2024): Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics. Instytut Slawistyki Zachodniej i Południowej, Uniwersytet Warszawski. Warszawa, 10/06/2024, [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/2024_UDCM_Wwa.pdf|slides]].
 +
 +Alexandr Rosen (2023). The InterCorp parallel corpus with a uniform annotation for all languages. Jazykovedný časopis, 74(1):254–265. [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/rosen-slovko-2023.pdf|Paper]], [[https://owncloud.korpus.cz/s/wLxfrmwKCACX73W|slides]].
 +
  
 ===== How to cite ===== ===== How to cite =====
Line 577: Line 589:
  
 <WRAP round info 50%> <WRAP round info 50%>
-Čermák, F., Rosen, A(2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427+Čermák, František & Alexandr Rosen. 2012. The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics// 13(3). 411–427.
 ([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]).  ([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). 
  
Line 584: Line 596:
 When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:
  
-Rosen, A., Šimčík, B., Vavřín, M., Zasina, A. J(2024). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 16ud of 17 September 2024//. Institute of the Czech National Corpus, Charles University, Prague 2024. Available on-line: https://kontext.korpus.cz/+Rosen, AlexandrBohumil Šimčík, Martin Vavřín & Adrian Jan Zasina. 2024. //The InterCorp Corpus – Czech((Insert languages actually used.)), version 16ud of 17 September 2024//. Institute of the Czech National Corpus, Charles University, Prague 2024. Available on-line: https://kontext.korpus.cz/
  
 </WRAP> </WRAP>