Obě strany předchozí revizePředchozí verzeNásledující verze | Předchozí verze |
cnk:intercorp:verze16ud [2024/10/11 10:29] – [Velikost korpusu v tisících slov podle jazyků a kolekcí] alexandrrosen | cnk:intercorp:verze16ud [2024/10/18 20:33] (aktuální) – [Odkazy – o korpusu InterCorp s anotací podle UD] alexandrrosen |
---|
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]| 1| 6 556| 6 594,8| 32 635,9| 38 097,3| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fa|fa]]| 1| 6 556| 6 594,8| 32 635,9| 38 097,3| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]| 117| 116 660| 25 976,1| 123 357,7| 165 696,1| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]| 117| 116 660| 25 976,1| 123 357,7| 165 696,1| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]| 310| 138 571| 33 957,7| 258 555,1| 315 325,2| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]| 310| 138 571| 33 957,7| 258 555,1| 315 325,2| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fr|fr]]| 1| 146| 121,7| 622,1| 797,9| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]| 1| 146| 121,7| 622,1| 797,9| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=gl|gl]]| 1| 33 935| 27 608,8| 129 458,6| 172 973,7| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]| 1| 33 935| 27 608,8| 129 458,6| 172 973,7| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=he|he]]| 8| 61| 116,6| 832,7| 988,1| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]| 8| 61| 116,6| 832,7| 988,1| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hi|hi]]| 327| 35 447| 30 758,6| 162 943,8| 208 413,5| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]| 327| 35 447| 30 758,6| 162 943,8| 208 413,5| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hr|hr]]| 13| 13| 41,6| 466,3| 586,3| | ^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]| 13| 13| 41,6| 466,3| 586,3| |
^[[https://en.wikipedia.org/wiki/Upper_Sorbian_language|hs]]| 95| 125 933| 34 510,0| 178 525,6| 240 411,9| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]| 95| 125 933| 34 510,0| 178 525,6| 240 411,9| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hu|hu]]| 1| 7| 3,9| 23,5| 30,6| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]| 1| 7| 3,9| 23,5| 30,6| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=hy|hy]]| 1| 8 350| 8 112,7| 37 824,9| 49 694,7| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]| 1| 8 350| 8 112,7| 37 824,9| 49 694,7| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=id|id]]| 1| 1 135| 1 497,9| 7 374,2| 9 299,9| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]| 1| 1 135| 1 497,9| 7 374,2| 9 299,9| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=is|is]]| 194| 134 401| 33 361,2| 226 224,9| 286 343,4| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]| 194| 134 401| 33 361,2| 226 224,9| 286 343,4| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=it|it]]| 37| 2 363| 2 296,7| 16 138,6| 18 020,3| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]| 37| 2 363| 2 296,7| 16 138,6| 18 020,3| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ja|ja]]| 1| 204| 198,4| 871,1| 1 179,0| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]| 1| 204| 198,4| 871,1| 1 179,0| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ka|ka]]| 1| 4| 4,1| 13,9| 19,2| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]| 1| 4| 4,1| 13,9| 19,2| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=kk|kk]]| 1| 1 605| 1 641,1| 5 964,3| 7 294,3| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]| 1| 1 605| 1 641,1| 5 964,3| 7 294,3| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ko|ko]]| 28| 87 642| 3 622,1| 34 786,3| 45 134,4| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]| 28| 87 642| 3 622,1| 34 786,3| 45 134,4| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lt|lt]]| 78| 86 356| 3 023,6| 35 425,1| 45 293,5| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]| 78| 86 356| 3 023,6| 35 425,1| 45 293,5| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=lv|lv]]| 109| 3 541| 3 907,8| 23 993,1| 30 898,6| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]| 109| 3 541| 3 907,8| 23 993,1| 30 898,6| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mk|mk]]| 1| 285| 365,3| 1 258,4| 1 793,5| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]| 1| 285| 365,3| 1 258,4| 1 793,5| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ml|ml]]| 1| 1 496| 1 712,1| 7 828,0| 10 573,3| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]| 1| 1 496| 1 712,1| 7 828,0| 10 573,3| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ms|ms]]| 1| 8 963| 784,8| 13 805,0| 16 643,6| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]| 1| 8 963| 784,8| 13 805,0| 16 643,6| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=mt|mt]]| 232| 132 791| 33 065,4| 233 111,3| 284 402,6| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]| 232| 132 791| 33 065,4| 233 111,3| 284 402,6| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=nl|nl]]| 105| 9 163| 8 344,6| 48 750,2| 61 120,3| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]| 105| 9 163| 8 344,6| 48 750,2| 61 120,3| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no|no]]| 360| 140 055| 41 282,4| 227 242,6| 300 207,8| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|nl]]| 360| 140 055| 41 282,4| 227 242,6| 300 207,8| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pl|pl]]| 107| 147 063| 46 510,1| 280 566,2| 355 121,8| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]| 107| 147 063| 46 510,1| 280 566,2| 355 121,8| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]| 2| 2| 1,7| 13,6| 17,7| | ^[[https://en.wikipedia.org/wiki/Romani_language|rn]]| 2| 2| 1,7| 13,6| 17,7| |
^[[https://en.wikipedia.org/wiki/Romani_language|rn]]| 55| 102 904| 39 561,2| 235 702,3| 295 301,3| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ru]]| 55| 102 904| 39 561,2| 235 702,3| 295 301,3| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]| 184| 32 839| 22 985,2| 122 130,4| 163 120,7| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]| 184| 32 839| 22 985,2| 122 130,4| 163 120,7| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]| 1| 499| 522,5| 2 313,4| 3 021,8| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]| 1| 499| 522,5| 2 313,4| 3 021,8| |
==== Velikost korpusu v tisících slov podle jazyků a kolekcí ==== | ==== Velikost korpusu v tisících slov podle jazyků a kolekcí ==== |
| |
^ [[https://en.wikipedia.org/wiki/ISO_639-1|Lang]] ^ Core-fiction ^ Core-misc ^ Core-nonfiction ^ Acquis ^ Bible ^ Europarl ^ PressEurop ^ Subtitles ^ Syndicate ^ CELKEM ^ | ^ [[https://en.wikipedia.org/wiki/ISO_639-1|Jazyk]] ^ Core-fiction ^ Core-misc ^ Core-nonfiction ^ Acquis ^ Bible ^ Europarl ^ PressEurop ^ Subtitles ^ Syndicate ^ CELKEM ^ |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=af|af]]| – | – | – | – | – | – | – | 134,6| – ^ 134,6^ | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=af|af]]| – | – | – | – | – | – | – | 134,6| – ^ 134,6^ |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ar|ar]]| 28,8| 5,5| – | – | – | – | – | 126 195,5| 384,5^ 126 614,3^ | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ar|ar]]| 28,8| 5,5| – | – | – | – | – | 126 195,5| 384,5^ 126 614,3^ |
^:::|Core-misc| 2| 2| 3,5| 44,4| 52,2| 733,0| 532,9| 12,820| 2,148| 1,051| 4,791| 1,821| 2,385| | ^:::|Core-misc| 2| 2| 3,5| 44,4| 52,2| 733,0| 532,9| 12,820| 2,148| 1,051| 4,791| 1,821| 2,385| |
^:::|Acquis| 1| 18 563| 1 310,5| 15 264,2| 19 702,1| 556,9| 380,4| 13,209| 2,369| 0,886| 6,990| 2,588| 2,647| | ^:::|Acquis| 1| 18 563| 1 310,5| 15 264,2| 19 702,1| 556,9| 380,4| 13,209| 2,369| 0,886| 6,990| 2,588| 2,647| |
^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=fi|fi]]|Bible| 2| 66| 48,0| 542,6| 675,3| 529,0| 351,4| 13,324| 1,911| 0,871| 4,231| 1,534| 2,511| | ^:::|Bible| 2| 66| 48,0| 542,6| 675,3| 529,0| 351,4| 13,324| 1,911| 0,871| 4,231| 1,534| 2,511| |
^:::|Europarl| 1| 67 019| 675,6| 10 109,3| 11 838,6| 670,8| 462,7| 15,260| 2,483| 1,242| 6,924| 2,670| 2,395| | ^:::|Europarl| 1| 67 019| 675,6| 10 109,3| 11 838,6| 670,8| 462,7| 15,260| 2,483| 1,242| 6,924| 2,670| 2,395| |
^:::|Subtitles| 1| 30 900| 23 262,2| 90 481,8| 124 969,7| 666,5| 444,7| 3,909| 1,244| 0,242| 1,404| 0,513| 1,689| | ^:::|Subtitles| 1| 30 900| 23 262,2| 90 481,8| 124 969,7| 666,5| 444,7| 3,909| 1,244| 0,242| 1,404| 0,513| 1,689| |
^:::|PressEurop| 7| 6 991| 160,6| 2 725,2| 3 192,6| 546,7| 429,5| 17,486| 2,219| 1,017| 8,508| 2,772| 2,492| | ^:::|PressEurop| 7| 6 991| 160,6| 2 725,2| 3 192,6| 546,7| 429,5| 17,486| 2,219| 1,017| 8,508| 2,772| 2,492| |
^:::|Subtitles| 1| 45 407| 38 108,1| 211 310,4| 266 731,5| 509,0| 351,2| 5,572| 1,388| 0,383| 2,129| 0,795| 1,954| | ^:::|Subtitles| 1| 45 407| 38 108,1| 211 310,4| 266 731,5| 509,0| 351,2| 5,572| 1,388| 0,383| 2,129| 0,795| 1,954| |
^:::|Core-nonfict| 10| 10| 30,6| 518,7| 625,2| 645,0| 495,9| 17,765| 2,613| 1,223| 8,126| 2,801| 2,603| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ru|ru]]|Core-nonfict| 10| 10| 30,6| 518,7| 625,2| 645,0| 495,9| 17,765| 2,613| 1,223| 8,126| 2,801| 2,603| |
^:::|Core-fiction| 144| 144| 1 043,5| 11 757,6| 14 913,7| 633,0| 501,9| 11,643| 1,959| 0,865| 4,203| 1,557| 2,386| | ^:::|Core-fiction| 144| 144| 1 043,5| 11 757,6| 14 913,7| 633,0| 501,9| 11,643| 1,959| 0,865| 4,203| 1,557| 2,386| |
^:::|Core-misc| 6| 6| 12,8| 143,8| 180,7| 633,2| 484,5| 11,439| 1,947| 0,870| 4,378| 1,718| 2,265| | ^:::|Core-misc| 6| 6| 12,8| 143,8| 180,7| 633,2| 484,5| 11,439| 1,947| 0,870| 4,378| 1,718| 2,265| |
| |
Olga Nádvorníková a Alexandr Rosen (2024): Vyhledávání v paralelním korpusu za použití anotace Universal Dependencies. [[https://www.youtube.com/watch?v=5l5Vbb1eQDw&t=190s|Záznam workshopu]] z 17. 9. 2024, doprovodné akce [[https://bcl2024.ff.cuni.cz|Bienále české lingvistiky 2024]], viz též [[https://jakobson.korpus.cz/~rosen/BCL2024/P18_SLIDES/Prezentace_Bienale2024_WorkShop.pdf|prezentace]]. | Olga Nádvorníková a Alexandr Rosen (2024): Vyhledávání v paralelním korpusu za použití anotace Universal Dependencies. [[https://www.youtube.com/watch?v=5l5Vbb1eQDw&t=190s|Záznam workshopu]] z 17. 9. 2024, doprovodné akce [[https://bcl2024.ff.cuni.cz|Bienále české lingvistiky 2024]], viz též [[https://jakobson.korpus.cz/~rosen/BCL2024/P18_SLIDES/Prezentace_Bienale2024_WorkShop.pdf|prezentace]]. |
| |
| Alexandr Rosen (2024): Lexical and syntactic variability |
| of languages and text genres – a corpus-based study. [[https://www.youtube.com/watch?v=E2ujmqt7Q2E|Záznam přednášky]] ze 14. 10. 2024, [[https://zil.ipipan.waw.pl/seminarium|Seminarium „Przetwarzanie języka naturalnego”]] [[https://zil.ipipan.waw.pl|Zespołu Inżynierii Lingwistycznej]] w [[https://ipipan.waw.pl|Instytucie Podstaw Informatyki]] [[https://pan.pl|Polskiej Akademii Nauk]], viz též [[https://zil.ipipan.waw.pl/seminarium-archiwum?action=AttachFile&do=view&target=2024-10-14.pdf|prezentace]]. |
| |
Alexandr Rosen (2024): Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics. Instytut Slawistyki Zachodniej i Południowej, Uniwersytet Warszawski. Warszawa, 10/06/2024. [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/2024_UDCM_Wwa.pdf|Prezentace]] | Alexandr Rosen (2024): Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics. Instytut Slawistyki Zachodniej i Południowej, Uniwersytet Warszawski. Warszawa, 10/06/2024. [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/2024_UDCM_Wwa.pdf|Prezentace]] |