====== InterCorp Release 16ud – Universal Dependencies ====== ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ ^ Positions ^ Number of tokens | 154 391 397 | 362 409 841 | 461 601 109 | 5 732 688 636 | ^ ::: ^ Number of word forms | 124 681 856 | 272 671 041 | 385 829 717 | 4 473 418 338 | ^ Structural attributes ^ Number of documents | 1 812 | 33 | 4 643 | 338 | ^ ::: ^ Number of texts | 1 812 | 162 613 | 4 643 | 2 662 675 | ^ ::: ^ Number of sentences | 10 691 340 | 50 729 559 | 28 684 709 | 790 046 584 | ^ Further information ^ reference | YES ^^^^ ^ ::: ^ representative | NO ^^^^ ^ ::: ^ publication date | 2024 ^^^^ ^ ::: ^ foreign languages | 61 ^^^^ ^ ::: ^ tagged languages | 47 ^^^^ ^ ::: ^ lemmatized languages | 47 ^^^^ ^ ::: ^ syntactically annotated languages| 47 ^^^^ ===== Access to the texts ===== After [[https://www.korpus.cz/signup|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus. A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]]. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[alexandr.rosen@ff.cuni.cz|Alexandr Rosen]] if you are interested. New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). ===== Main features of release 16ud ===== * For a detailed description of UD as used in the annotation of InterCorp see the **[[en:pojmy:ud|Universal Dependencies]]** entry in the [[en:pojmy:prehled_pojmu|glossary]]. * After 13ud, 16ud is the second release of InterCorp featuring linguistic annotation according to the [[en:pojmy:ud|Universal Dependencies]] scheme. * Release 16ud is the first CNC corpus to feature the metrics of **[[en:pojmy:syntakticka_komplexita|syntactic complexity]]** and **[[en:pojmy:lexikalni_bohatost|lexical diversity]]**. * In release 16ud, out of the total number of 62 languages ​​(including Czech), **47 are linguistically annotated**; in addition, all such languages ​​are **syntactically annotated**. * Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org|Universal Dependencies]]). * Annotation was performed for all languages ​​by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((The tool uses all data for the given language, ie all treebanks listed on [[https://lindat.mff.cuni.cz/services/udpipe/UDPipe]]. Annotation of this release used the following models: TODO!!! arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.)) ===== Texts in the corpus ===== InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method. The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually checked. The other texts, grouped in **collections**, are aligned automatically without human intervention. The choice in the present release includes: * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] (below referred to as **Syndicate**) and [[http://www.voxeurop.eu|VoxEurop]] (formerly **PressEurop**) * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus (**Acquis**) * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus (**Europarl**) * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database (**Subtitles**) * Translations of the **Bible** In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 16ud published in September 2024 is 4 746 mil. words. This number includes 382 mil. words in the aligned foreign language texts in the core part and 4 746 mil. words in the collections. The number of words in the Czech texts is 125 mil. in the core part and 273 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus is shown in the following charts. The charts show the volumes in millions of words. [{{:cnk:intercorp:intercorp_wordcounts_v16.png?1000|Setup of the parallel corpus – the core and collections}}] \\ [{{:cnk:intercorp:intercorp_wordcounts2_v16.png?1000|Setup of the parallel corpus – the core}}] \\ [{{:cnk:intercorp:intercorp_wordcounts3_v16.png?1000|Setup of the parallel corpus – collections}}] ===== The corpus in numbers ===== ==== Number of texts in the Core ==== ^ Language ^^ Number of texts ^ including originals ^ ^ ar ^ Arabic | 3 | 1 | ^ be ^ Belarusian | 108 | 14 | ^ bg ^ Bulgarian | 87 | 19 | ^ ca ^ Catalan | 92 | 1 | ^ cs ^ Czech | 1 812 | 368 | ^ da ^ Danish | 93 | 9 | ^ de ^ German | 471 | 163 | ^ en ^ English | 422 | 271 | ^ es ^ Spanish | 355 | 142 | ^ et ^ Estonian | 1 | 0 | ^ fi ^ Finnish | 112 | 36 | ^ fr ^ French | 277 | 126 | ^ hi ^ Hindi | 7 | 2 | ^ hr ^ Croatian | 324 | 37 | ^ hs ^ Upper Sorbian | 13 | 5 | ^ hu ^ Hungarian | 89 | 1 | ^ it ^ Italian | 171 | 26 | ^ ja ^ Japanese | 35 | 15 | ^ lt ^ Lithuanian | 23 | 4 | ^ lv ^ Latvian | 73 | 15 | ^ mk ^ Macedonian | 108 | 4 | ^ nl ^ Dutch | 215 | 52 | ^ no ^ Norwegian | 102 | 23 | ^ pl ^ Polish | 348 | 54 | ^ pt ^ Portuguese | 87 | 24 | ^ rn ^ Romani | 2 | 2 | ^ ro ^ Romanian | 45 | 5 | ^ ru ^ Russian | 160 | 37 | ^ sk ^ Slovak | 165 | 62 | ^ sl ^ Slovene | 73 | 25 | ^ sr ^ Serbian | 148 | 8 | ^ sv ^ Swedish | 232 | 101 | ^ uk ^ Ukrainian | 199 | 8 | ^ zh ^ Chinese | 3 | 3 | ^ **TOTAL** ^ | 6 495 | 1 668 | In the tables below, the Core part of the corpus is split according to the text type into fiction (**Core-fiction**), non-fiction (**Core-nonfiction**), and miscellaneous (**Core-misc**), including drama, poetry or children's literature). ==== Corpus size by collection ==== ^ Collection ^ Number of ^^ Thousands of ^^^ ^::: ^ docs ^ texts ^ sentences ^ words ^ tokens ^ ^Core-fiction| 5 879| 5 879| 37 270| 473 208| 572 187| ^Core-misc| 226| 226| 623| 7 853| 9 424| ^Core-nonfiction| 350| 350| 1 483| 29 450| 34 381| ^Acquis| 22| 380 049| 28 903| 424 874| 531 415| ^Bible| 38| 1 252| 899| 12 050| 14 405| ^Europarl| 21| 1 369 378| 13 709| 276 543| 315 134| ^PressEurop| 70| 69 894| 1 637| 26 964| 31 538| ^Subtitles| 58| 965 557| 793 931| 3 970 273| 5 162 184| ^Syndicate| 162| 39 158| 1 697| 35 385| 40 423| ^TOTAL^ 6 826^ 2 831 743^ 880 152^ 5 256 601^ 6 711 091^ ==== Corpus size by language ==== ^ [[https://en.wikipedia.org/wiki/ISO_639-1|Lang]] ^ Number of ^^ Thousands of ^^^ ^::: ^ docs ^ texts ^ sentences ^ words ^ tokens ^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=af|af]]| 1| 24| 23.0| 134.6| 161.7| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ar|ar]]| 7| 34 629| 28 748.8| 126 614.3| 157 671.0| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=be|be]]| 108| 108| 632.7| 7 126.4| 9 054.9| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bg|bg]]| 90| 97 190| 34 421.2| 194 375.7| 250 957.1| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bn|bn]]| 1| 252| 363.8| 1 517.7| 2 072.1| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=br|br]]| 1| 27| 19.7| 97.4| 145.2| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bs|bs]]| 1| 14 208| 12 165.3| 56 465.9| 75 945.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ca|ca]]| 95| 828| 1 201.8| 13 381.4| 15 617.1| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=cs|cs]]| 1 845| 164 425| 61 420.9| 397 352.9| 516 801.2| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=da|da]]| 98| 101 609| 16 583.0| 115 590.0| 146 193.4| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=de|de]]| 504| 115 755| 23 827.8| 181 773.9| 229 774.0| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=el|el]]| 3| 125 684| 33 174.5| 200 922.9| 254 776.7| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=en|en]]| 455| 157 490| 54 572.6| 357 080.3| 449 890.9| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=eo|eo]]| 1| 46| 48.4| 221.0| 305.4| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=es|es]]| 386| 150 798| 45 280.2| 305 112.0| 388 664.2| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=et|et]]| 4| 100 709| 13 904.0| 80 349.3| 104 726.8| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=eu|eu]]| 1| 652| 732.9| 2 999.9| 4 039.0| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]| 1| 6 556| 6 594.8| 32 635.9| 38 097.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]| 117| 116 660| 25 976.1| 123 357.7| 165 696.1| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]| 310| 138 571| 33 957.7| 258 555.1| 315 325.2| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]| 1| 146| 121.7| 622.1| 797.9| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]| 1| 33 935| 27 608.8| 129 458.6| 172 973.7| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]| 8| 61| 116.6| 832.7| 988.1| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]| 327| 35 447| 30 758.6| 162 943.8| 208 413.5| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]| 13| 13| 41.6| 466.3| 586.3| ^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]| 95| 125 933| 34 510.0| 178 525.6| 240 411.9| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]| 1| 7| 3.9| 23.5| 30.6| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]| 1| 8 350| 8 112.7| 37 824.9| 49 694.7| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]| 1| 1 135| 1 497.9| 7 374.2| 9 299.9| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]| 194| 134 401| 33 361.2| 226 224.9| 286 343.4| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]| 37| 2 363| 2 296.7| 16 138.6| 18 020.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]| 1| 204| 198.4| 871.1| 1 179.0| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]| 1| 4| 4.1| 13.9| 19.2| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]| 1| 1 605| 1 641.1| 5 964.3| 7 294.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]| 28| 87 642| 3 622.1| 34 786.3| 45 134.4| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]| 78| 86 356| 3 023.6| 35 425.1| 45 293.5| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]| 109| 3 541| 3 907.8| 23 993.1| 30 898.6| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]| 1| 285| 365.3| 1 258.4| 1 793.5| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]| 1| 1 496| 1 712.1| 7 828.0| 10 573.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]| 1| 8 963| 784.8| 13 805.0| 16 643.6| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]| 232| 132 791| 33 065.4| 233 111.3| 284 402.6| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]| 105| 9 163| 8 344.6| 48 750.2| 61 120.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]| 360| 140 055| 41 282.4| 227 242.6| 300 207.8| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]| 107| 147 063| 46 510.1| 280 566.2| 355 121.8| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]| 2| 2| 1.7| 13.6| 17.7| ^[[https://en.wikipedia.org/wiki/Romani_language|rn]]| 55| 102 904| 39 561.2| 235 702.3| 295 301.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]| 184| 32 839| 22 985.2| 122 130.4| 163 120.7| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]| 1| 499| 522.5| 2 313.4| 3 021.8| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sk|sk]]| 170| 94 585| 10 080.0| 74 862.7| 95 881.0| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sl|sl]]| 76| 104 460| 20 501.3| 118 457.1| 155 788.9| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sq|sq]]| 1| 1 575| 1 769.0| 9 171.4| 12 098.4| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sr|sr]]| 149| 38 177| 32 117.7| 165 130.2| 211 727.6| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sv|sv]]| 237| 104 739| 19 113.9| 135 088.4| 164 715.5| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ta|ta]]| 1| 20| 29.4| 104.0| 141.8| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=te|te]]| 1| 18| 26.0| 96.0| 127.1| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=th|th]]| 1| 3 932| 3 457.0| 5 626.0| 7 288.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=tl|tl]]| 1| 5| 8.0| 37.0| 52.7| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=tr|tr]]| 1| 44 015| 35 975.7| 147 635.3| 199 108.2| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=uk|uk]]| 202| 1 271| 2 138.0| 19 225.4| 24 818.3| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ur|ur]]| 1| 19| 27.0| 155.7| 180.8| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=vi|vi]]| 1| 3 468| 3 304.5| 19 281.4| 23 984.0| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=zh|zh]]| 9| 12 035| 11 993.7| 71 855.3| 80 560.0| ^TOTAL^ 6 826^ 2 831 743^ 880 152.2^ 5 256 601.0^ 6 711 091.0^ ==== Corpus size in thousands of words by language and collection ==== ^ [[https://en.wikipedia.org/wiki/ISO_639-1|Lang]] ^ Core-fiction ^ Core-misc ^ Core-nonfiction ^ Acquis ^ Bible ^ Europarl ^ PressEurop ^ Subtitles ^ Syndicate ^ TOTAL ^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=af|af]]| – | – | – | – | – | – | – | 134.6| – ^ 134.6^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ar|ar]]| 28.8| 5.5| – | – | – | – | – | 126 195.5| 384.5^ 126 614.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=be|be]]| 7 068.7| 57.7| – | – | – | – | – | – | – ^ 7 126.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bg|bg]]| 7 067.3| – | – | 13 582.3| – | 9 082.0| – | 164 644.1| – ^ 194 375.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bn|bn]]| – | – | – | – | – | – | – | 1 517.7| – ^ 1 517.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=br|br]]| – | – | – | – | – | – | – | 97.4| – ^ 97.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bs|bs]]| – | – | – | – | – | – | – | 56 465.9| – ^ 56 465.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ca|ca]]| 9 951.3| 9.7| – | – | 728.2| – | – | 2 692.1| – ^ 13 381.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=cs|cs]]| 113 632.3| 2 637.1| 8 412.5| 19 188.9| 562.5| 12 918.7| 2 313.3| 232 969.1| 4 718.6^ 397 352.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=da|da]]| 9 460.8| 11.9| 56.0| 20 014.9| 655.2| 13 800.4| – | 71 590.8| – ^ 115 590.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=de|de]]| 35 653.3| 1 066.1| 4 037.3| 20 716.9| 725.0| 13 156.2| 2 506.5| 98 808.9| 5 103.7^ 181 773.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=el|el]]| – | – | – | 23 684.5| – | 15 381.7| – | 161 856.7| – ^ 200 922.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=en|en]]| 36 519.3| 778.3| 4 618.7| 23 062.9| 727.6| 15 593.0| 2 663.8| 267 843.8| 5 272.8^ 357 080.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=eo|eo]]| – | – | – | – | – | – | – | 221.0| – ^ 221.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=es|es]]| 29 664.1| 165.1| 830.9| 26 269.3| – | 16 248.5| 2 857.8| 223 006.0| 6 070.2^ 305 112.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=et|et]]| 78.8| – | – | 14 884.2| – | 10 898.7| – | 54 487.7| – ^ 80 349.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=eu|eu]]| – | – | – | – | – | – | – | 2 999.9| – ^ 2 999.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]| – | – | – | – | – | – | – | 32 635.9| – ^ 32 635.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]| 6 714.9| 44.4| 200.5| 15 264.2| 542.6| 10 109.3| – | 90 481.8| – ^ 123 357.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]| 20 454.4| 194.3| 3 687.5| 26 298.4| 762.6| 17 186.4| 3 044.3| 181 033.4| 5 893.7^ 258 555.1^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]| – | – | – | – | – | – | – | 622.1| – ^ 622.1^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]| – | – | – | – | – | – | – | 129 458.6| – ^ 129 458.6^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]| 402.8| – | – | – | – | – | – | 429.9| – ^ 832.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]| 22 763.6| 242.6| 1 523.4| – | 569.9| – | – | 137 844.3| – ^ 162 943.8^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]| 405.3| 36.6| 24.4| – | – | – | – | – | – ^ 466.3^ ^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]| 6 890.1| 28.9| – | 17 851.3| – | 12 187.9| – | 141 559.0| 8.4^ 178 525.6^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]| – | – | – | – | – | – | – | 23.5| – ^ 23.5^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]| – | – | – | – | – | – | – | 37 824.9| – ^ 37 824.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]| – | – | – | – | – | – | – | 7 374.2| – ^ 7 374.2^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]| 17 435.8| 50.6| 647.8| 23 892.0| 685.2| 15 511.4| 2 750.7| 163 859.9| 1 391.5^ 226 224.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]| 3 766.7| 64.9| 163.1| – | – | – | – | 12 141.5| 2.5^ 16 138.6^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]| – | – | – | – | – | – | – | 871.1| – ^ 871.1^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]| – | – | – | – | – | – | – | 13.9| – ^ 13.9^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]| – | – | – | – | – | – | – | 5 964.3| – ^ 5 964.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]| 669.1| 7.2| 17.4| 17 175.1| 471.2| 11 198.5| – | 5 247.7| – ^ 34 786.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]| 3 207.6| 362.1| 66.9| 17 519.4| 536.7| 11 682.0| – | 2 050.4| – ^ 35 425.1^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]| 8 794.5| 86.5| – | – | – | – | – | 15 112.0| – ^ 23 993.1^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]| – | – | – | – | – | – | – | 1 258.4| – ^ 1 258.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]| – | – | – | – | – | – | – | 7 828.0| – ^ 7 828.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]| – | – | – | 13 805.0| – | – | – | – | – ^ 13 805.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]| 17 229.8| 356.4| 1 193.5| 23 401.1| 716.8| 15 555.9| 2 952.8| 170 892.9| 812.1^ 233 111.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]| 7 690.7| 138.1| 392.0| – | 723.9| – | – | 39 805.6| – ^ 48 750.2^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]| 27 056.2| 283.2| 754.2| 19 482.9| 576.1| 12 662.8| 2 367.5| 164 059.8| – ^ 227 242.6^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]| 7 204.0| 81.3| – | 24 385.0| 706.2| 15 188.4| 2 782.5| 229 480.2| 738.5^ 280 566.2^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]| 8.4| 5.2| – | – | – | – | – | – | – ^ 13.6^ ^[[https://en.wikipedia.org/wiki/Romani_language|rn]]| 4 132.6| 64.1| – | 8 043.5| – | 9 426.4| 2 725.2| 211 310.4| – ^ 235 702.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]| 11 757.6| 143.8| 518.7| – | 565.5| – | – | 104 831.9| 4 312.8^ 122 130.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]| – | – | – | – | – | – | – | 2 313.4| – ^ 2 313.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sk|sk]]| 7 626.6| 402.2| 558.0| 18 398.8| 560.8| 12 727.0| – | 34 589.4| – ^ 74 862.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sl|sl]]| 4 611.2| 6.1| 22.4| 18 510.4| – | 12 249.8| – | 83 057.1| – ^ 118 457.1^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sq|sq]]| – | – | – | – | – | – | – | 9 171.4| – ^ 9 171.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sr|sr]]| 12 556.0| 29.3| 119.3| – | – | – | – | 152 425.6| – ^ 165 130.2^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sv|sv]]| 18 011.7| 454.8| 1 273.0| 19 443.0| 637.9| 13 777.6| – | 81 490.5| – ^ 135 088.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ta|ta]]| – | – | – | – | – | – | – | 104.0| – ^ 104.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=te|te]]| – | – | – | – | – | – | – | 96.0| – ^ 96.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=th|th]]| – | – | – | – | – | – | – | 5 626.0| – ^ 5 626.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=tl|tl]]| – | – | – | – | – | – | – | 37.0| – ^ 37.0^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=tr|tr]]| – | – | – | – | – | – | – | 147 635.3| – ^ 147 635.3^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=uk|uk]]| 14 478.3| 38.9| 333.0| – | 596.1| – | – | 3 779.0| – ^ 19 225.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ur|ur]]| – | – | – | – | – | – | – | 155.7| – ^ 155.7^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=vi|vi]]| – | – | – | – | – | – | – | 19 281.4| – ^ 19 281.4^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=zh|zh]]| 215.4| – | – | – | – | – | – | 70 963.9| 675.9^ 71 855.3^ ^TOTAL^ 473 208.2^ 7 852.9^ 29 450.5^ 424 874.2^ 12 050.1^ 276 542.6^ 26 964.4^ 3 970 272.9^ 35 385.2^ 5 256 601.0^ ==== Detailed statistics ==== In addition to the corpus size date, the table includes also measures of statistical complexity and diversity. For languages without linguistic annotation, the table shows only the wordform-based measure of lexical diversity (lexDivWord). ^ [[https://en.wikipedia.org/wiki/ISO_639-1|Lang]] ^ Collection ^ Number of ^^ Thousands of ^^^ [[en:pojmy:lexikalni_bohatost|Lexical diversity]] ^^ [[en:pojmy:syntakticka_komplexita|Syntactic complexity]] (average) ^^^^^^ ^::: ^::: ^ docs ^ texts ^ sentences ^ words ^ tokens ^ lexDivWord ^ lexDivLemma ^ sLength ^ subRatio ^ maxTreeDepth ^ maxNPLength ^ maxNPDepth ^ mdd ^ ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=af|af]]|Subtitles| 1| 24| 23.0| 134.6| 161.7| 406.4| 347.2| 5.887| 1.093| 0.095| 2.377| 0.811| 2.251| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ar|ar]]|Core-fiction| 2| 2| 2.1| 28.8| 35.6| 620.3| 576.6| 13.830| 2.712| 1.310| 5.293| 2.016| 2.817| ^:::|Core-misc| 1| 1| 1.3| 5.5| 7.4| 451.4| 421.4| 4.150| 1.330| 0.290| 1.870| 0.840| 2.010| ^:::|Subtitles| 1| 34 193| 28 726.4| 126 195.5| 157 188.9| 592.8| 557.3| 4.421| 1.338| 0.336| 2.216| 0.986| 1.678| ^:::|Syndicate| 3| 433| 19.0| 384.5| 439.0| 622.7| 560.3| 20.513| 2.485| 1.312| 11.036| 3.940| 2.405| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=be|be]]|Core-fiction| 104| 104| 625.1| 7 068.7| 8 978.9| 615.4| 492.7| 11.583| 1.865| 0.804| 4.122| 1.436| 2.316| ^:::|Core-misc| 4| 4| 7.6| 57.7| 76.0| 556.2| 425.6| 7.608| 1.672| 0.605| 2.870| 1.002| 2.254| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bg|bg]]|Core-fiction| 87| 87| 559.6| 7 067.3| 8 597.7| 548.3| 439.5| 13.125| 1.728| 0.732| 4.255| 1.532| 2.497| ^:::|Acquis| 1| 10 846| 862.3| 13 582.3| 16 991.2| 392.4| 306.3| 18.073| 1.801| 0.514| 9.389| 2.805| 3.265| ^:::|Europarl| 1| 45 271| 408.3| 9 082.0| 10 379.8| 498.4| 386.3| 23.014| 2.538| 1.263| 10.961| 3.402| 2.581| ^:::|Subtitles| 1| 40 986| 32 591.1| 164 644.1| 214 988.4| 518.2| 384.6| 5.089| 1.336| 0.322| 1.861| 0.706| 1.931| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bn|bn]]|Subtitles| 1| 252| 363.8| 1 517.7| 2 072.1| 419.4| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=br|br]]|Subtitles| 1| 27| 19.7| 97.4| 145.2| 363.5| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=bs|bs]]|Subtitles| 1| 14 208| 12 165.3| 56 465.9| 75 945.3| 450.2| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ca|ca]]|Core-fiction| 91| 91| 678.0| 9 951.3| 11 363.4| 471.6| 375.2| 15.579| 2.140| 0.962| 6.099| 1.920| 2.551| ^:::|Core-misc| 1| 1| 0.7| 9.7| 11.2| 463.7| 362.5| 14.300| 2.040| 0.930| 5.850| 1.880| 2.520| ^:::|Bible| 2| 66| 50.3| 728.2| 839.4| 405.3| 308.0| 15.729| 2.056| 0.912| 6.460| 2.103| 2.602| ^:::|Subtitles| 1| 670| 472.8| 2 692.1| 3 403.2| 487.0| 346.8| 5.726| 1.379| 0.352| 2.617| 0.926| 2.028| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=cs|cs]]|Core-fiction| 1 629| 1.629| 9 979.9| 113 632.3| 141 075.8| 629.8| 484.2| 11.722| 1.702| 0.723| 4.078| 1.459| 2.486| ^:::|Core-nonfict| 113| 113| 488.9| 8 412.5| 10 107.3| 649.3| 501.8| 18.099| 2.107| 1.004| 8.159| 2.685| 2.607| ^:::|Core-misc| 70| 70| 222.5| 2 637.1| 3 208.3| 639.0| 492.3| 12.264| 1.721| 0.704| 5.105| 1.778| 2.412| ^:::|Acquis| 1| 19 269| 1 351.5| 19 188.9| 25 140.4| 472.1| 346.5| 16.575| 1.745| 0.536| 9.788| 2.858| 3.025| ^:::|Bible| 2| 66| 51.0| 562.5| 692.9| 537.1| 372.0| 11.907| 1.603| 0.635| 4.125| 1.590| 2.451| ^:::|Europarl| 1| 69 482| 685.3| 12 918.7| 15 030.4| 600.9| 435.0| 19.380| 2.428| 1.256| 9.361| 3.180| 2.527| ^:::|PressEurop| 7| 7 060| 170.0| 2 313.3| 2 786.6| 669.3| 522.4| 14.002| 1.895| 0.810| 7.023| 2.498| 2.457| ^:::|Subtitles| 1| 60 619| 48 207.7| 232 969.1| 313 262.9| 589.7| 406.3| 4.866| 1.307| 0.319| 1.862| 0.694| 1.971| ^:::|Syndicate| 21| 6 117| 264.0| 4 718.6| 5 496.6| 655.9| 506.1| 18.410| 2.162| 1.059| 8.528| 2.975| 2.552| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=da|da]]|Core-fiction| 90| 90| 685.3| 9 460.8| 11 273.9| 464.6| 388.7| 14.334| 1.712| 0.694| 4.949| 1.649| 2.514| ^:::|Core-nonfict| 1| 1| 2.7| 56.0| 64.2| 447.6| 364.4| 21.690| 2.250| 1.070| 9.140| 2.900| 2.670| ^:::|Core-misc| 2| 2| 0.8| 11.9| 14.2| 441.6| 363.4| 14.515| 1.714| 0.728| 5.350| 1.836| 2.466| ^:::|Acquis| 1| 18 263| 1 566.7| 20 014.9| 25 402.6| 395.0| 333.1| 14.462| 1.647| 0.485| 8.314| 2.491| 2.762| ^:::|Bible| 2| 66| 46.1| 655.2| 782.3| 389.8| 318.7| 18.349| 1.970| 0.843| 5.542| 1.828| 2.811| ^:::|Europarl| 1| 67 202| 721.6| 13 800.4| 15 775.5| 448.2| 376.6| 19.372| 2.025| 0.910| 9.165| 2.947| 2.597| ^:::|Subtitles| 1| 15 985| 13 559.9| 71 590.8| 92 880.6| 438.1| 346.4| 5.338| 1.184| 0.190| 1.985| 0.701| 1.925| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=de|de]]|Core-fiction| 412| 412| 2 603.0| 35 653.3| 43 380.5| 515.3| 421.1| 14.176| 1.775| 0.702| 4.819| 1.458| 3.095| ^:::|Core-nonfict| 43| 43| 205.6| 4 037.3| 4 754.4| 525.3| 434.8| 20.302| 2.015| 0.862| 8.836| 2.456| 3.384| ^:::|Core-misc| 16| 16| 63.4| 1 066.1| 1 255.8| 515.6| 425.5| 17.694| 1.942| 0.817| 7.345| 2.138| 3.219| ^:::|Acquis| 1| 18 782| 1 451.4| 20 716.9| 26 206.7| 407.9| 343.3| 16.124| 1.506| 0.388| 9.197| 2.496| 3.519| ^:::|Bible| 2| 66| 49.2| 725.0| 854.0| 395.2| 302.4| 15.637| 1.648| 0.657| 5.263| 1.737| 2.998| ^:::|Europarl| 1| 62 391| 661.2| 13 156.2| 15 169.0| 487.1| 396.8| 20.448| 2.074| 0.914| 9.361| 2.646| 3.473| ^:::|PressEurop| 7| 6 909| 175.9| 2 506.5| 3 013.6| 545.0| 456.2| 14.623| 1.702| 0.621| 6.859| 2.124| 3.123| ^:::|Subtitles| 1| 21 322| 18 354.4| 98 808.9| 129 234.7| 489.6| 380.9| 5.414| 1.240| 0.231| 2.119| 0.712| 2.271| ^:::|Syndicate| 21| 5 814| 263.6| 5 103.7| 5 905.3| 541.1| 453.0| 19.817| 2.000| 0.867| 8.766| 2.590| 3.380| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=el|el]]|Acquis| 1| 18 904| 1 432.0| 23 684.5| 28 955.7| 409.0| 313.2| 17.722| 1.884| 0.707| 10.688| 2.957| 2.690| ^:::|Europarl| 1| 68 069| 623.6| 15 381.7| 17 233.2| 488.5| 366.9| 25.498| 2.664| 1.379| 12.485| 3.413| 2.682| ^:::|Subtitles| 1| 38 711| 31 118.9| 161 856.7| 208 587.8| 516.7| 376.8| 6.335| 1.613| 0.519| 2.566| 0.881| 2.083| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=en|en]]|Core-fiction| 366| 366| 2 701.0| 36 519.3| 43 557.4| 466.2| 403.2| 14.159| 2.107| 0.945| 5.371| 1.689| 2.576| ^:::|Core-nonfict| 39| 39| 216.2| 4 618.7| 5 302.9| 466.7| 412.4| 22.976| 2.623| 1.292| 10.373| 2.893| 2.793| ^:::|Core-misc| 17| 17| 53.4| 778.3| 905.9| 455.8| 393.7| 15.091| 2.160| 0.967| 6.561| 1.987| 2.557| ^:::|Acquis| 1| 18 930| 1 327.2| 23 062.9| 28 075.3| 346.1| 307.3| 20.073| 2.193| 0.806| 11.086| 2.912| 3.176| ^:::|Bible| 2| 66| 47.5| 727.6| 843.4| 354.0| 296.2| 17.458| 2.166| 1.051| 6.271| 2.125| 2.608| ^:::|Europarl| 1| 69 283| 680.9| 15 593.0| 17 455.0| 411.9| 362.9| 23.743| 2.692| 1.402| 11.274| 3.135| 2.736| ^:::|PressEurop| 7| 7 019| 152.5| 2 663.8| 3 107.7| 485.4| 431.4| 18.016| 2.286| 1.033| 8.828| 2.614| 2.689| ^:::|Subtitles| 1| 55 657| 49 130.9| 267 843.8| 344 553.0| 445.1| 362.4| 5.491| 1.401| 0.372| 2.273| 0.811| 2.067| ^:::|Syndicate| 21| 6 113| 263.1| 5 272.8| 6 090.3| 494.2| 438.7| 20.792| 2.447| 1.186| 9.516| 2.843| 2.733| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=eo|eo]]|Subtitles| 1| 46| 48.4| 221.0| 305.4| 384.4| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=es|es]]|Core-fiction| 338| 338| 1 981.3| 29 664.1| 34 294.8| 495.7| 400.3| 15.586| 2.176| 0.974| 6.243| 1.919| 2.574| ^:::|Core-nonfict| 10| 10| 29.6| 830.9| 932.4| 446.5| 361.9| 29.055| 2.939| 1.456| 13.399| 3.468| 2.797| ^:::|Core-misc| 7| 7| 15.0| 165.1| 198.8| 475.1| 370.6| 11.674| 1.781| 0.662| 4.887| 1.575| 2.382| ^:::|Acquis| 1| 19 056| 1 333.1| 26 269.3| 31 277.0| 348.0| 290.7| 22.339| 1.851| 0.588| 12.954| 3.099| 3.098| ^:::|Europarl| 1| 67 754| 660.7| 16 248.5| 18 032.0| 437.6| 353.4| 25.496| 2.614| 1.350| 12.798| 3.348| 2.618| ^:::|PressEurop| 7| 6 891| 154.6| 2 857.8| 3 268.7| 478.0| 399.9| 18.995| 2.144| 0.940| 9.483| 2.729| 2.567| ^:::|Subtitles| 1| 50 705| 40 849.5| 223 006.0| 293 901.1| 498.6| 355.5| 5.499| 1.404| 0.373| 2.378| 0.862| 1.972| ^:::|Syndicate| 21| 6 037| 256.4| 6 070.2| 6 759.4| 462.1| 384.1| 24.411| 2.437| 1.189| 11.558| 3.194| 2.675| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=et|et]]|Core-fiction| 1| 1| 6.7| 78.8| 96.1| 626.8| 478.0| 11.790| 2.020| 0.920| 4.200| 1.540| 2.530| ^:::|Acquis| 1| 18 727| 1 349.8| 14 884.2| 19 414.5| 543.8| 404.0| 13.084| 2.744| 0.961| 6.654| 2.304| 2.972| ^:::|Europarl| 1| 68 478| 704.3| 10 898.7| 12 761.7| 635.2| 463.0| 15.935| 2.687| 1.347| 7.271| 2.669| 2.517| ^:::|Subtitles| 1| 13 503| 11 843.3| 54 487.7| 72 454.4| 575.2| 386.4| 4.625| 1.284| 0.281| 1.616| 0.600| 1.967| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=eu|eu]]|Subtitles| 1| 652| 732.9| 2 999.9| 4 039.0| 600.9| 401.1| 4.112| 1.280| 0.265| 1.371| 0.522| 1.745| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]|Subtitles| 1| 6 556| 6 594.8| 32 635.9| 38 097.3| 520.5| 472.5| 4.973| 1.368| 0.338| 2.363| 0.974| 2.301| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|Core-fiction| 106| 106| 661.7| 6 714.9| 8 221.3| 683.9| 507.1| 10.287| 1.844| 0.806| 3.437| 1.295| 2.279| ^:::|Core-nonfict| 4| 4| 14.4| 200.5| 237.0| 685.3| 489.0| 14.336| 2.401| 1.208| 5.977| 2.378| 2.435| ^:::|Core-misc| 2| 2| 3.5| 44.4| 52.2| 733.0| 532.9| 12.820| 2.148| 1.051| 4.791| 1.821| 2.385| ^:::|Acquis| 1| 18 563| 1 310.5| 15 264.2| 19 702.1| 556.9| 380.4| 13.209| 2.369| 0.886| 6.990| 2.588| 2.647| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|Bible| 2| 66| 48.0| 542.6| 675.3| 529.0| 351.4| 13.324| 1.911| 0.871| 4.231| 1.534| 2.511| ^:::|Europarl| 1| 67 019| 675.6| 10 109.3| 11 838.6| 670.8| 462.7| 15.260| 2.483| 1.242| 6.924| 2.670| 2.395| ^:::|Subtitles| 1| 30 900| 23 262.2| 90 481.8| 124 969.7| 666.5| 444.7| 3.909| 1.244| 0.242| 1.404| 0.513| 1.689| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]|Core-fiction| 230| 230| 1 277.5| 20 454.4| 23 802.5| 471.0| 377.5| 16.762| 2.156| 0.998| 6.617| 1.994| 2.685| ^:::|Core-nonfict| 37| 37| 152.5| 3 687.5| 4 206.8| 456.2| 373.9| 26.628| 2.938| 1.451| 12.424| 3.202| 2.807| ^:::|Core-misc| 10| 10| 20.0| 194.3| 229.5| 443.7| 336.9| 9.973| 1.703| 0.614| 4.205| 1.321| 2.427| ^:::|Acquis| 1| 19 057| 1 338.5| 26 298.4| 31 764.2| 353.5| 289.2| 22.521| 2.416| 0.946| 13.347| 3.212| 3.144| ^:::|Bible| 2| 66| 50.6| 762.6| 886.3| 384.9| 285.9| 17.822| 2.060| 0.893| 6.743| 2.171| 2.758| ^:::|Europarl| 1| 68 220| 677.7| 17 186.4| 18 984.0| 425.6| 338.2| 26.070| 2.866| 1.565| 13.013| 3.423| 2.638| ^:::|PressEurop| 7| 7 025| 163.8| 3 044.3| 3 510.4| 476.4| 396.4| 19.097| 2.279| 1.036| 9.836| 2.826| 2.606| ^:::|Subtitles| 1| 38 341| 30 038.8| 181 033.4| 225 399.3| 453.5| 325.6| 6.061| 1.405| 0.394| 2.563| 0.926| 2.031| ^:::|Syndicate| 21| 5 585| 238.3| 5 893.7| 6 542.1| 457.8| 379.9| 25.332| 2.742| 1.410| 12.251| 3.308| 2.698| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]|Subtitles| 1| 146| 121.7| 622.1| 797.9| 529.5| 411.1| 5.144| 1.339| 0.323| 2.602| 0.940| 1.958| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]|Subtitles| 1| 33 935| 27 608.8| 129 458.6| 172 973.7| 549.8| 479.7| 4.747| 1.370| 0.346| 2.637| 1.064| 1.918| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]|Core-fiction| 7| 7| 35.6| 402.8| 462.1| 449.6| 348.6| 11.386| 1.586| 0.524| 4.610| 1.625| 2.692| ^:::|Subtitles| 1| 54| 81.0| 429.9| 526.0| 401.1| 324.0| 5.336| 1.156| 0.146| 2.358| 0.838| 2.190| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]|Core-fiction| 292| 292| 1 822.5| 22 763.6| 27 339.1| 591.4| 460.5| 12.720| 1.902| 0.837| 4.247| 1.472| 2.623| ^:::|Core-nonfict| 22| 22| 72.0| 1 523.4| 1 742.7| 600.4| 451.3| 21.425| 2.682| 1.368| 9.323| 2.963| 2.718| ^:::|Core-misc| 10| 10| 19.6| 242.6| 298.0| 570.4| 431.2| 12.616| 2.062| 0.909| 4.645| 1.536| 2.570| ^:::|Bible| 2| 66| 48.1| 569.9| 686.1| 519.1| 381.3| 12.989| 1.855| 0.773| 4.359| 1.599| 2.504| ^:::|Subtitles| 1| 35 057| 28 796.3| 137 844.3| 178 347.5| 566.6| 421.2| 4.795| 1.392| 0.373| 1.814| 0.681| 1.929| ^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]|Core-fiction| 8| 8| 36.2| 405.3| 512.1| 503.3| – | – | – | – | – | – | – | ^:::|Core-nonfict| 1| 1| 1.9| 24.4| 29.5| 571.5| – | – | – | – | – | – | – | ^:::|Core-misc| 4| 4| 3.5| 36.6| 44.7| 513.6| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]|Core-fiction| 87| 87| 573.2| 6 890.1| 8 657.7| 603.6| 499.1| 12.888| 1.709| 0.706| 3.698| 1.367| 2.759| ^:::|Core-misc| 2| 2| 6.1| 28.9| 39.5| 568.2| 457.3| 4.817| 1.269| 0.254| 1.817| 0.650| 2.100| ^:::|Acquis| 1| 18 539| 1 290.2| 17 851.3| 22 815.8| 485.6| 385.2| 16.126| 1.825| 0.515| 7.743| 2.832| 3.421| ^:::|Europarl| 1| 66 229| 677.3| 12 187.9| 14 266.5| 591.1| 469.4| 18.625| 2.202| 1.013| 7.465| 2.741| 2.799| ^:::|Subtitles| 1| 41 067| 31 962.7| 141 559.0| 194 622.6| 586.7| 466.0| 4.609| 1.261| 0.268| 1.644| 0.627| 1.859| ^:::|Syndicate| 3| 9| 0.5| 8.4| 9.8| 598.4| 481.5| 16.869| 2.080| 0.933| 6.351| 2.436| 2.685| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]|Subtitles| 1| 7| 3.9| 23.5| 30.6| 601.7| 445.9| 6.057| 1.375| 0.382| 2.179| 0.860| 2.075| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]|Subtitles| 1| 8 350| 8 112.7| 37 824.9| 49 694.7| 475.7| 401.9| 4.699| 1.344| 0.317| 2.343| 0.911| 1.742| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]|Subtitles| 1| 1 135| 1 497.9| 7 374.2| 9 299.9| 503.5| 369.4| 4.951| 1.233| 0.233| 1.913| 0.699| 1.841| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]|Core-fiction| 164| 164| 1 205.7| 17 435.8| 20 566.1| 529.6| 414.7| 15.157| 2.092| 0.973| 6.471| 1.970| 2.578| ^:::|Core-nonfict| 5| 5| 22.4| 647.8| 738.9| 486.6| 389.2| 31.080| 3.082| 1.564| 16.597| 3.877| 2.931| ^:::|Core-misc| 2| 2| 4.0| 50.6| 61.7| 505.7| 378.9| 14.351| 2.299| 1.040| 5.722| 1.817| 2.633| ^:::|Acquis| 1| 18 893| 1 345.7| 23 892.0| 29 413.1| 390.7| 306.5| 20.391| 2.112| 0.766| 13.152| 3.242| 3.156| ^:::|Bible| 2| 65| 47.3| 685.2| 806.6| 421.8| 317.0| 16.561| 1.969| 0.881| 6.739| 2.168| 2.723| ^:::|Europarl| 1| 69 139| 650.3| 15 511.4| 17 235.8| 486.8| 381.6| 24.916| 2.686| 1.409| 13.989| 3.644| 2.603| ^:::|PressEurop| 7| 7 024| 156.3| 2 750.7| 3 155.3| 524.2| 421.2| 18.041| 2.121| 0.943| 9.814| 2.803| 2.553| ^:::|Subtitles| 1| 37 721| 29 870.5| 163 859.9| 212 801.7| 532.8| 384.1| 5.518| 1.325| 0.319| 2.535| 0.903| 2.008| ^:::|Syndicate| 11| 1 388| 58.9| 1 391.5| 1 564.2| 504.3| 403.4| 24.516| 2.535| 1.261| 12.837| 3.463| 2.682| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]|Core-fiction| 33| 33| 201.2| 3 766.7| 4 262.0| 365.4| 336.3| 18.928| 3.094| 1.432| 8.630| 2.666| 2.697| ^:::|Core-nonfict| 1| 1| 7.0| 163.1| 184.2| 361.2| 334.5| 23.420| 3.540| 1.720| 11.650| 3.490| 2.810| ^:::|Core-misc| 1| 1| 2.1| 64.9| 75.9| 280.9| 257.8| 31.520| 4.300| 1.990| 16.990| 4.490| 3.160| ^:::|Subtitles| 1| 2 326| 2 086.3| 12 141.5| 13 495.4| 381.9| 348.5| 6.212| 1.417| 0.375| 3.221| 1.312| 1.909| ^:::|Syndicate| 1| 2| 0.1| 2.5| 2.9| 385.4| 372.0| 38.705| 4.330| 2.015| 20.881| 4.923| 3.215| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]|Subtitles| 1| 204| 198.4| 871.1| 1 179.0| 380.8| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]|Subtitles| 1| 4| 4.1| 13.9| 19.2| 657.7| 607.3| 3.389| 1.243| 0.247| 1.761| 0.892| 1.603| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]|Subtitles| 1| 1 605| 1 641.1| 5 964.3| 7 294.3| 690.6| 686.3| 3.682| 1.529| 0.457| 1.146| 0.440| 1.785| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]|Core-fiction| 20| 20| 61.4| 669.1| 842.6| 685.2| 545.5| 11.223| 1.901| 0.813| 3.957| 1.479| 2.487| ^:::|Core-nonfict| 1| 1| 1.3| 17.4| 23.1| 657.2| 492.4| 14.180| 2.190| 0.930| 6.670| 2.230| 2.600| ^:::|Core-misc| 2| 2| 1.2| 7.2| 9.0| 764.9| 628.4| 6.184| 1.430| 0.409| 2.921| 1.136| 1.887| ^:::|Acquis| 1| 18 809| 1 477.8| 17 175.1| 22 835.1| 515.0| 346.3| 13.456| 2.504| 0.938| 6.985| 2.531| 2.867| ^:::|Bible| 2| 66| 46.1| 471.2| 596.3| 550.9| 439.8| 10.822| 1.668| 0.706| 3.866| 1.500| 2.281| ^:::|Europarl| 1| 67 719| 688.5| 11 198.5| 13 475.2| 627.2| 441.4| 16.816| 3.016| 1.607| 7.683| 2.906| 2.469| ^:::|Subtitles| 1| 1 025| 1 345.9| 5 247.7| 7 353.0| 624.6| 461.7| 3.923| 1.278| 0.286| 1.552| 0.569| 1.760| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]|Core-fiction| 65| 65| 291.9| 3 207.6| 4 032.0| 639.2| 494.8| 11.339| 1.758| 0.756| 3.605| 1.343| 2.563| ^:::|Core-nonfict| 1| 1| 3.3| 66.9| 89.0| 680.0| 541.3| 21.810| 2.310| 1.070| 9.480| 2.800| 2.910| ^:::|Core-misc| 7| 7| 30.0| 362.1| 440.5| 688.1| 543.4| 12.147| 1.759| 0.776| 4.397| 1.668| 2.337| ^:::|Acquis| 1| 18 348| 1 486.3| 17 519.4| 23 361.6| 490.0| 340.4| 13.790| 2.296| 0.831| 7.109| 2.492| 2.865| ^:::|Bible| 2| 66| 40.1| 536.7| 671.7| 495.5| 343.1| 13.645| 1.663| 0.754| 4.180| 1.602| 2.658| ^:::|Europarl| 1| 67 482| 683.7| 11 682.0| 13 896.8| 590.6| 416.3| 17.627| 2.434| 1.255| 7.884| 2.853| 2.497| ^:::|Subtitles| 1| 387| 488.4| 2 050.4| 2 801.9| 592.2| 425.9| 4.227| 1.269| 0.264| 1.568| 0.592| 1.811| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]|Core-fiction| 104| 104| 694.6| 8 794.5| 10 571.7| 464.3| – | – | – | – | – | – | – | ^:::|Core-misc| 4| 4| 12.1| 86.5| 109.3| 422.0| – | – | – | – | – | – | – | ^:::|Subtitles| 1| 3 433| 3 201.0| 15 112.0| 20 217.5| 412.3| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]|Subtitles| 1| 285| 365.3| 1 258.4| 1 793.5| 489.8| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]|Subtitles| 1| 1 496| 1 712.1| 7 828.0| 10 573.3| 371.2| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]|Acquis| 1| 8 963| 784.8| 13 805.0| 16 643.6| 373.4| 1.0| 20.381| 2.683| 1.141| 11.437| 3.347| 2.933| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]|Core-fiction| 194| 194| 1 152.0| 17 229.8| 19 889.7| 466.9| 403.0| 15.424| 2.149| 0.959| 5.255| 1.558| 3.176| ^:::|Core-nonfict| 12| 12| 50.6| 1 193.5| 1 336.2| 449.2| 391.5| 25.698| 2.909| 1.375| 10.658| 2.784| 3.453| ^:::|Core-misc| 9| 9| 27.2| 356.4| 413.4| 463.6| 395.9| 13.450| 1.993| 0.860| 5.102| 1.550| 2.981| ^:::|Acquis| 1| 18 975| 1 483.9| 23 401.1| 28 140.1| 356.2| 317.3| 18.005| 2.217| 0.766| 9.491| 2.375| 3.553| ^:::|Bible| 2| 66| 45.8| 716.8| 821.3| 386.8| 326.1| 17.940| 2.264| 1.067| 5.942| 1.936| 3.042| ^:::|Europarl| 1| 67 139| 693.8| 15 555.9| 17 074.8| 425.7| 371.8| 22.952| 2.500| 1.217| 10.132| 2.744| 3.274| ^:::|PressEurop| 7| 7 009| 175.4| 2 952.8| 3 337.6| 483.7| 429.2| 17.267| 2.172| 0.967| 7.879| 2.300| 3.107| ^:::|Subtitles| 1| 38 546| 29 399.1| 170 892.9| 212 492.5| 444.4| 354.8| 5.847| 1.485| 0.439| 2.180| 0.728| 2.291| ^:::|Syndicate| 5| 841| 37.6| 812.1| 897.2| 477.2| 422.9| 22.671| 2.570| 1.225| 10.005| 2.810| 3.282| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]|Core-fiction| 91| 91| 558.7| 7 690.7| 9 028.3| 461.6| 383.6| 14.346| 1.842| 0.849| 4.850| 1.579| 2.599| ^:::|Core-nonfict| 5| 5| 17.0| 392.0| 439.5| 467.7| 381.2| 24.705| 2.716| 1.366| 11.077| 3.100| 2.842| ^:::|Core-misc| 6| 6| 10.7| 138.1| 163.5| 450.0| 372.8| 14.035| 1.835| 0.794| 5.029| 1.509| 2.619| ^:::|Bible| 2| 66| 55.3| 723.9| 831.4| 364.6| 294.7| 13.099| 1.573| 0.620| 4.645| 1.713| 2.447| ^:::|Subtitles| 1| 8 995| 7 702.8| 39 805.6| 50 657.6| 448.4| 353.0| 5.188| 1.299| 0.298| 1.960| 0.697| 1.917| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]|Core-fiction| 328| 328| 2 400.0| 27 056.2| 33 548.9| 632.2| 499.6| 11.498| 1.896| 0.833| 4.162| 1.523| 2.355| ^:::|Core-nonfict| 11| 11| 36.6| 754.2| 897.7| 613.7| 460.1| 20.825| 2.743| 1.407| 9.385| 3.192| 2.509| ^:::|Core-misc| 9| 9| 24.4| 283.2| 345.5| 622.9| 471.6| 12.263| 1.981| 0.881| 4.978| 1.892| 2.266| ^:::|Acquis| 1| 19 024| 1 657.3| 19 482.9| 24 945.6| 481.4| 350.6| 13.373| 2.035| 0.714| 7.737| 2.681| 2.622| ^:::|Bible| 2| 66| 48.2| 576.1| 712.9| 537.0| 387.8| 12.695| 1.724| 0.727| 4.479| 1.725| 2.397| ^:::|Europarl| 1| 67 443| 713.3| 12 662.8| 14 667.8| 607.5| 447.2| 18.340| 2.643| 1.309| 9.387| 3.283| 2.322| ^:::|PressEurop| 7| 6 999| 166.6| 2 367.5| 2 879.1| 659.8| 520.6| 14.632| 2.143| 0.957| 7.092| 2.645| 2.334| ^:::|Subtitles| 1| 46 175| 36 236.0| 164 059.8| 222 210.4| 602.1| 441.5| 4.556| 1.324| 0.319| 1.855| 0.717| 1.832| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]|Core-fiction| 82| 82| 519.8| 7 204.0| 8 608.5| 511.2| 408.1| 14.436| 2.299| 1.142| 6.372| 2.041| 2.497| ^:::|Core-misc| 5| 5| 6.9| 81.3| 96.0| 495.3| 388.9| 12.461| 2.159| 0.977| 6.238| 1.780| 2.486| ^:::|Acquis| 1| 18 934| 1 356.4| 24 385.0| 29 549.7| 377.3| 305.5| 20.372| 2.488| 0.967| 12.971| 3.327| 3.020| ^:::|Bible| 2| 66| 54.3| 706.2| 840.4| 380.3| 293.5| 19.149| 2.385| 1.111| 7.620| 2.305| 2.957| ^:::|Europarl| 1| 65 92| 648.7| 15 188.4| 17 127.0| 467.5| 379.1| 24.202| 3.093| 1.726| 13.821| 3.724| 2.591| ^:::|PressEurop| 7| 6 967| 160.9| 2 782.5| 3 286.5| 507.4| 422.0| 17.848| 2.388| 1.150| 10.138| 2.951| 2.536| ^:::|Subtitles| 1| 54 342| 43 730.9| 229 480.2| 294 774.7| 495.5| 360.1| 5.278| 1.449| 0.432| 2.528| 0.955| 1.939| ^:::|Syndicate| 8| 747| 32.4| 738.5| 839.0| 489.9| 405.1| 23.875| 2.980| 1.575| 12.669| 3.544| 2.646| ^[[https://en.wikipedia.org/wiki/Romani_language|rn]]|Core-fiction| 1| 1| 1.1| 8.4| 11.1| 424.3| – | – | – | – | – | – | – | ^:::|Core-misc| 1| 1| 0.7| 5.2| 6.6| 416.4| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]|Core-fiction| 44| 44| 233.3| 4 132.6| 4 833.2| 534.2| 406.3| 18.106| 2.262| 1.146| 6.360| 2.019| 2.604| ^:::|Core-misc| 1| 1| 2.7| 64.1| 74.2| 539.5| 414.1| 23.970| 2.690| 1.500| 10.330| 2.910| 2.680| ^:::|Acquis| 1| 6 318| 650.0| 8 043.5| 9 884.4| 405.3| 301.4| 14.150| 2.221| 0.770| 7.930| 2.544| 2.900| ^:::|Europarl| 1| 44 143| 406.6| 9 426.4| 10 585.4| 499.1| 368.7| 23.966| 2.798| 1.517| 11.591| 3.558| 2.484| ^:::|PressEurop| 7| 6 991| 160.6| 2 725.2| 3 192.6| 546.7| 429.5| 17.486| 2.219| 1.017| 8.508| 2.772| 2.492| ^:::|Subtitles| 1| 45 407| 38 108.1| 211 310.4| 266 731.5| 509.0| 351.2| 5.572| 1.388| 0.383| 2.129| 0.795| 1.954| ^:::|Core-nonfict| 10| 10| 30.6| 518.7| 625.2| 645.0| 495.9| 17.765| 2.613| 1.223| 8.126| 2.801| 2.603| ^:::|Core-fiction| 144| 144| 1 043.5| 11 757.6| 14 913.7| 633.0| 501.9| 11.643| 1.959| 0.865| 4.203| 1.557| 2.386| ^:::|Core-misc| 6| 6| 12.8| 143.8| 180.7| 633.2| 484.5| 11.439| 1.947| 0.870| 4.378| 1.718| 2.265| ^:::|Bible| 2| 66| 39.0| 565.5| 703.9| 486.6| 346.2| 20.730| 2.746| 1.302| 6.198| 2.121| 2.828| ^:::|Subtitles| 1| 27 195| 21 625.8| 104 831.9| 141 586.8| 574.9| 428.1| 4.878| 1.423| 0.401| 1.930| 0.744| 1.887| ^:::|Syndicate| 21| 5 418| 233.5| 4 312.8| 5 110.5| 637.5| 487.3| 19.037| 2.653| 1.288| 9.232| 3.298| 2.424| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]|Subtitles| 1| 499| 522.5| 2 313.4| 3 021.8| 443.6| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sk|sk]]|Core-fiction| 142| 142| 706.0| 7 626.6| 9 513.5| 617.0| 480.8| 10.845| 1.562| 0.612| 3.503| 1.284| 2.620| ^:::|Core-nonfict| 10| 10| 39.1| 558.0| 687.3| 650.0| 517.1| 14.785| 1.516| 0.547| 6.760| 2.344| 2.518| ^:::|Core-misc| 13| 13| 32.4| 402.2| 496.9| 652.5| 515.7| 12.636| 1.564| 0.555| 5.338| 1.707| 2.493| ^:::|Acquis| 1| 18 302| 1 363.0| 18 398.8| 23 542.1| 482.7| 353.1| 15.458| 1.732| 0.516| 8.677| 2.746| 3.029| ^:::|Bible| 2| 65| 46.9| 560.8| 690.8| 520.0| 373.4| 12.716| 1.615| 0.662| 4.178| 1.576| 2.567| ^:::|Europarl| 1| 67 731| 677.8| 12 727.0| 14 735.3| 595.1| 433.8| 19.150| 2.344| 1.172| 9.020| 3.065| 2.538| ^:::|Subtitles| 1| 8 322| 7 214.8| 34 589.4| 46 215.1| 575.9| 411.5| 4.821| 1.293| 0.295| 1.835| 0.674| 1.975| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sl|sl]]|Core-fiction| 71| 71| 370.5| 4 611.2| 5 686.2| 556.5| 428.7| 12.704| 2.096| 0.857| 4.122| 1.374| 2.641| ^:::|Core-nonfict| 1| 1| 1.1| 22.4| 24.9| 656.4| 528.9| 21.090| 1.980| 0.830| 8.840| 2.930| 2.890| ^:::|Core-misc| 1| 1| 0.7| 6.1| 7.4| 682.1| 585.6| 8.950| 1.720| 0.650| 4.410| 1.720| 2.210| ^:::|Acquis| 1| 17 414| 1 399.2| 18 510.4| 24 069.9| 466.2| 335.6| 15.345| 1.810| 0.580| 8.359| 2.683| 2.841| ^:::|Europarl| 1| 65 366| 649.6| 12 249.8| 14 263.6| 564.3| 405.6| 19.433| 2.551| 1.254| 9.220| 3.066| 2.539| ^:::|Subtitles| 1| 21 607| 18 080.2| 83 057.1| 111 736.8| 568.0| 399.0| 4.620| 1.333| 0.309| 1.726| 0.625| 1.899| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sq|sq]]|Subtitles| 1| 1 575| 1 769.0| 9 171.4| 12 098.4| 395.5| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sr|sr]]|Core-fiction| 143| 143| 931.6| 12 556.0| 15 029.8| 584.7| 462.0| 13.767| 1.956| 0.898| 4.690| 1.601| 2.638| ^:::|Core-nonfict| 2| 2| 5.9| 119.3| 138.9| 565.0| 417.2| 20.654| 2.876| 1.518| 8.918| 2.889| 2.655| ^:::|Core-misc| 3| 3| 5.0| 29.3| 38.9| 538.0| 411.7| 5.882| 1.394| 0.371| 2.405| 0.906| 2.215| ^:::|Subtitles| 1| 38 029| 31 175.3| 152 425.6| 196 520.1| 561.3| 445.3| 4.901| 1.338| 0.333| 1.905| 0.722| 1.938| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sv|sv]]|Core-fiction| 208| 208| 1 398.8| 18 011.7| 20 456.7| 490.6| 403.5| 13.175| 1.944| 0.848| 4.403| 1.445| 2.501| ^:::|Core-nonfict| 16| 16| 64.9| 1 273.0| 1 403.1| 508.2| 415.4| 19.801| 2.541| 1.288| 7.980| 2.435| 2.683| ^:::|Core-misc| 8| 8| 28.5| 454.8| 512.3| 490.1| 404.4| 16.027| 2.123| 1.026| 5.575| 1.790| 2.561| ^:::|Acquis| 1| 17 133| 1 285.5| 19 443.0| 23 283.7| 402.1| 327.7| 16.286| 1.913| 0.705| 8.700| 2.448| 2.784| ^:::|Bible| 2| 66| 43.9| 637.9| 731.7| 414.2| 323.2| 14.907| 1.947| 0.895| 4.760| 1.703| 2.542| ^:::|Europarl| 1| 67 898| 720.6| 13 777.6| 15 146.8| 461.9| 374.1| 19.313| 2.381| 1.183| 8.221| 2.554| 2.640| ^:::|Subtitles| 1| 19 41| 15 571.7| 81 490.5| 103 181.3| 455.7| 352.1| 5.256| 1.319| 0.303| 1.921| 0.684| 1.921| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ta|ta]]|Subtitles| 1| 20| 29.4| 104.0| 141.8| 511.8| 434.1| 3.562| 1.196| 0.171| 1.673| 0.639| 1.807| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=te|te]]|Subtitles| 1| 18| 26.0| 96.0| 127.1| 496.5| 1.0| 3.806| 1.324| 0.284| 1.746| 0.658| 2.086| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=th|th]]|Subtitles| 1| 3 932| 3 457.0| 5 626.0| 7 288.3| 658.1| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=tl|tl]]|Subtitles| 1| 5| 8.0| 37.0| 52.7| 344.9| – | – | – | – | – | – | – | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=tr|tr]]|Subtitles| 1| 44 015| 35 975.7| 147 635.3| 199 108.2| 670.1| 424.8| 4.133| 1.259| 0.257| 1.929| 0.853| 1.815| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=uk|uk]]|Core-fiction| 192| 192| 1 260.0| 14 478.3| 18 490.6| 626.6| 506.4| 11.923| 2.047| 0.892| 4.187| 1.507| 2.377| ^:::|Core-nonfict| 5| 5| 19.1| 333.0| 416.1| 621.1| 469.6| 19.193| 2.945| 1.432| 8.468| 2.909| 2.517| ^:::|Core-misc| 2| 2| 4.0| 38.9| 50.3| 614.9| 484.8| 9.801| 1.851| 0.774| 3.366| 1.282| 2.254| ^:::|Bible| 2| 66| 41.5| 596.1| 738.1| 475.7| 352.8| 14.784| 1.804| 0.777| 4.921| 1.751| 2.585| ^:::|Subtitles| 1| 1 006| 813.4| 3 779.0| 5 123.2| 571.4| 461.9| 4.684| 1.360| 0.334| 1.853| 0.710| 1.897| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ur|ur]]|Subtitles| 1| 19| 27.0| 155.7| 180.8| 397.6| 344.1| 5.885| 1.204| 0.178| 2.777| 1.098| 2.260| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=vi|vi]]|Subtitles| 1| 3 468| 3 304.5| 19 281.4| 23 984.0| 446.3| 403.8| 5.931| 1.508| 0.458| 2.351| 0.945| 1.849| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=zh|zh]]|Core-fiction| 3| 3| 11.7| 215.4| 253.9| 382.0| 376.8| 18.467| 4.655| 1.684| 4.099| 1.594| 3.435| ^:::|Subtitles| 1| 11 378| 11 952.3| 70 963.9| 79 539.4| 448.9| 439.5| 6.046| 1.689| 0.548| 2.081| 0.791| 2.289| ^:::|Syndicate| 5| 654| 29.7| 675.9| 766.7| 493.8| 489.5| 23.166| 4.110| 1.795| 7.026| 2.391| 3.366| ===== Metadata ===== Metadata such as the text's title, author, or source language are available for most texts as attributes of structural elements such as text or sentence. To view the list of such attributes and to select those that should be displayed in the KonText query results, choose the relevant InterCorp 16ud language in the KonText corpus search tool, and then go to ''Structures'' or ''References'' in the ''Corpus-specific settings'' menu. ===== Acknowledgements ===== We are grateful for the possibility to use the following texts and software: ==== Texts: ==== * The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš. * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to Adrian Barentsen * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] * Newspaper texts in a number of languages from the [[http://www.voxeurop.eu|Presseurop/VoxEurop]] server * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus * Proceedings of the European Parliament from the [[http://www.statmt.org/europarl/|EuroParl]] corpus * Slovak-Czech concordances from the [[http://korpus.juls.savba.sk/|Slovak National Corpus]] * Short stories in a number of languages [[http://www.goethe.de/ins/cz/prj/m89/csindex.htm|My 1989]] from [[http://www.goethe.de/ins/cz/pra/|Goethe Institut]] * A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness * George Orwell's novel //1984// in a number of languages from the [[http://nl.ijs.si/ME/|Multext-East]] corpus * Ukrainian and Polish texts from the [[http://www.domeczek.pl/~polukr/|PolUkr]] corpus * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug & co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] ==== Pre-processing ==== * Parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]] * Sentence splitter for Czech by Pavel Květoň * Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička * Sentence splitter Punkt for all other languages from [[http://www.nltk.org/|Natural Language Toolkit]] ==== Linguistic annotation ==== * [[http://ufal.mff.cuni.cz/udpipe|UDPipe]] (thanks to Jana Straková and Milan Straka, Dan Zeman and Martin Popel) ===== How to cite ===== If you publish results based on InterCorp we would appreciate a link to the project site [[https://intercorp.korpus.cz/|www.intercorp.korpus.cz]]. In your scientific publications please cite the following paper: Čermák, F., Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//. Vol. 13, no. 3, p. 411–427 ([[http://utkl.ff.cuni.cz/~rosen/public/mybib_bib.html#cermak:rosen:10|bibtex]], [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at ingentaConnect]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). For more references see the [[https://www.korpus.cz/biblio|repository of bibliographical items based on the CNC]]. All references to work based on InterCorp are welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details. When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: Nádvorníková, O., Rosen, A., Šimík, B., Vavřín, M., Zasina, A. J. (2024). //The InterCorp Corpus – Czech((Insert languages actually used.)), version 16ud of ?? 