~~NOTOC~~ ====== InterCorp: Release 4 ====== InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. After registration [[http://korpus.cz/english/prohlaseni-aj.php|here]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. There are several aspects that make InterCorp special among the corpora published by ICNC: * InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]]. * InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing. ===== Texts in the corpus ===== The bulk of InterCorp consists of fiction in Czech and other languages, semi-automatically aligned, and a selection of political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.presseurop.eu/|Presseurop]]. These texts have been aligned automatically: search results may include more misaligned segments. Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of InterCorp as of September 2011 (InterCorp v. 4, see [[en:cnk:intercorp:historie|Version history]]) is 92 290 000 words in aligned foreign language texts. This number already includes Project Syndicate (approximately 2.3 - 3 million words in cs, de, en, es, fr, ru) and Presseurop (approximately 0.8 million words in cs, de, en, es, fr, it, nl, pl, pt, ro). The corpus composition can be seen in the following chart where the common title "beletrie" ("fiction") denotes all manually aligned texts, mostly (but not exclusively) containing fiction. The bars show the size in millions of words. [{{:en:cnk:intercorp:intercorp_wordcounts_v4.png|Composition of the parallel corpora}}] The following table presents the sizes of parallel corpora in the individual languages. Numbers in each row show number of words in the given language (in thousands) which are also available in the language indicated by the column header. For instance, virtual Bulgarian-Croatian corpus contains 187 thousand words in Bulgarian (1st row - "bg", 9th column - "hr") and 189 thousand words in Croatian (9th row - "hr", 1st column - "bg"). The second (highlighted) column shows the number of words aligned with Czech and thus also the overall size of the monolingual corpus of the language given on the corresponding row. ^ bg ^ cs ^ da ^ de ^ en ^ es ^ fi ^ fr ^ hr ^ hu ^ it ^ lt ^ lv ^ nl ^ no ^ pl ^ pt ^ ro ^ ru ^ sl ^ sk ^ sr ^ sv ^ ^ bg | 1135 | 1135 | 0 | 82 | 74 | 82 | 74 | 0 | 187 | 141 | 156 | 0 | 0 | 74 | 0 | 156 | 74 | 0 | 0 | 0 | 0 | 0 | 156 | ^ cs | 1139 | 46196 | 149 | 10544 | 6287 | 12177 | 1678 | 4075 | 6415 | 1162 | 3502 | 418 | 1128 | 4175 | 1815 | 6217 | 2109 | 1416 | 3563 | 893 | 7072 | 2521 | 4633 | ^ da | 0 | 190 | 190 | 87 | 130 | 0 | 0 | 0 | 87 | 0 | 0 | 87 | 0 | 0 | 130 | 136 | 0 | 0 | 130 | 87 | 0 | 87 | 87 | ^ de | 87 | 12167 | 83 | 12167 | 3802 | 4953 | 176 | 3717 | 1967 | 295 | 1654 | 259 | 22 | 1973 | 1020 | 1850 | 749 | 835 | 2934 | 428 | 431 | 552 | 989 | ^ en | 80 | 7297 | 135 | 3821 | 7297 | 3761 | 438 | 3448 | 519 | 104 | 1053 | 381 | 2 | 1092 | 397 | 1449 | 876 | 954 | 2836 | 286 | 0 | 383 | 343 | ^ es | 90 | 14237 | 0 | 5331 | 4141 | 14237 | 353 | 4072 | 2409 | 164 | 2924 | 169 | 0 | 2150 | 670 | 1834 | 1098 | 1128 | 2988 | 98 | 133 | 790 | 1375 | ^ fi | 62 | 1435 | 0 | 128 | 332 | 325 | 1435 | 107 | 234 | 73 | 62 | 73 | 0 | 109 | 107 | 242 | 62 | 73 | 81 | 73 | 0 | 98 | 164 | ^ fr | 0 | 5234 | 0 | 4228 | 3947 | 4207 | 155 | 5234 | 515 | 0 | 1181 | 0 | 0 | 948 | 155 | 1272 | 870 | 873 | 3003 | 68 | 0 | 78 | 414 | ^ hr | 189 | 6735 | 76 | 1736 | 461 | 2175 | 280 | 409 | 6735 | 83 | 1491 | 324 | 43 | 1084 | 870 | 1160 | 447 | 277 | 232 | 352 | 54 | 927 | 997 | ^ hu | 132 | 1123 | 0 | 256 | 81 | 135 | 81 | 0 | 79 | 1123 | 0 | 81 | 0 | 56 | 202 | 287 | 0 | 81 | 202 | 283 | 284 | 115 | 0 | ^ it | 174 | 4028 | 0 | 1678 | 1059 | 2815 | 84 | 1064 | 1607 | 0 | 4028 | 162 | 0 | 1308 | 844 | 1214 | 1384 | 798 | 62 | 72 | 0 | 732 | 849 | ^ lt | 0 | 358 | 58 | 185 | 259 | 115 | 71 | 0 | 253 | 71 | 113 | 358 | 16 | 196 | 173 | 297 | 43 | 71 | 101 | 129 | 13 | 171 | 58 | ^ lv | 0 | 1075 | 0 | 18 | 2 | 0 | 0 | 0 | 39 | 0 | 0 | 18 | 1075 | 2 | 2 | 36 | 0 | 0 | 0 | 19 | 233 | 0 | 0 | ^ nl | 80 | 5203 | 0 | 2202 | 1176 | 2273 | 149 | 968 | 1286 | 73 | 1433 | 281 | 3 | 5203 | 724 | 1632 | 1039 | 1047 | 64 | 78 | 0 | 482 | 574 | ^ no | 0 | 2158 | 135 | 965 | 394 | 693 | 144 | 144 | 990 | 164 | 891 | 259 | 3 | 706 | 2158 | 597 | 524 | 0 | 407 | 255 | 263 | 759 | 678 | ^ pl | 143 | 6173 | 111 | 1652 | 1256 | 1536 | 276 | 1052 | 1101 | 296 | 1063 | 346 | 37 | 1300 | 503 | 6173 | 829 | 900 | 237 | 283 | 178 | 220 | 553 | ^ pt | 82 | 2503 | 0 | 853 | 931 | 1105 | 82 | 854 | 486 | 0 | 1454 | 66 | 0 | 1003 | 519 | 1002 | 2503 | 855 | 66 | 0 | 0 | 519 | 263 | ^ ro | 0 | 1697 | 0 | 900 | 967 | 1107 | 106 | 817 | 327 | 106 | 814 | 106 | 0 | 968 | 0 | 1064 | 815 | 1697 | 0 | 106 | 0 | 578 | 85 | ^ ru | 0 | 3619 | 99 | 2636 | 2581 | 2444 | 92 | 2382 | 215 | 197 | 50 | 123 | 0 | 52 | 387 | 230 | 52 | 0 | 3619 | 268 | 197 | 71 | 163 | ^ sl | 0 | 992 | 81 | 407 | 257 | 106 | 91 | 60 | 377 | 308 | 78 | 172 | 21 | 78 | 297 | 317 | 0 | 91 | 297 | 992 | 237 | 243 | 189 | ^ sk | 0 | 6961 | 0 | 361 | 0 | 104 | 0 | 0 | 50 | 290 | 0 | 15 | 245 | 0 | 276 | 175 | 0 | 0 | 200 | 220 | 6961 | 84 | 117 | ^ sr | 0 | 2736 | 77 | 503 | 346 | 751 | 124 | 62 | 943 | 127 | 692 | 222 | 0 | 405 | 681 | 237 | 477 | 509 | 77 | 242 | 100 | 2736 | 271 | ^ sv | 178 | 5234 | 83 | 954 | 339 | 1366 | 214 | 371 | 1091 | 0 | 859 | 83 | 0 | 518 | 610 | 645 | 227 | 87 | 187 | 196 | 129 | 256 | 5234 | ===== Morphosyntactic annotation ===== Texts in the following languages have received some morphosyntactic annotation. See [[en:park:navod|Park Manual]] for advice on the use of tags in queries. ===== Please note: work in progress ===== The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at: martin.vavrin@ff.cuni.cz ===== Acknowledgements ===== We are grateful for the possibility to use the following software and data: ==== Pre-processing ==== * Sentence splitter for Czech by Pavel Květoň * Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička * Sentence splitter Punkt for all other languages from [[http://www.nltk.org/|Natural Language Toolkit]] * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]] ==== Taggers/lemmatizers: ==== * [[http://ufal.mff.cuni.cz/morce/|Morče]] for Czech * [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] for English, German, French, Italian, Dutch, Spanish, Bulgarian and Russian * [[http://sgjp.pl/morfeusz/|Morfeusz]] and [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] for Polish * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] * [[http://donelaitis.vdu.lt/~vidas/|Tagger for Lithuanian]] * [[http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger|Analyzer]] and [[http://omilia.uio.no/obt/|tagger]] for Norwegian ==== Corpus Query Engine: ==== * [[http://www.textforge.cz/products|Manatee]] ==== Data: ==== * Newspaper articles in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] * Slovak-Czech concordances from the [[http://korpus.juls.savba.sk/|Slovak National Corpus]] * Short stories in a number of languages [[http://www.goethe.de/ins/cz/prj/m89/csindex.htm|My 1989]] from [[http://www.goethe.de/ins/cz/pra/|Goethe Institut]] * A number of texts in the Czech-Lithuanian section of the corpus from Patrick Corness * George Orwell's novel //1984// in a number of languages from the [[http://nl.ijs.si/ME/|Multext-East]] corpus * Ukrainian and Polish texts from the [[http://www.domeczek.pl/~polukr/|PolUkr]] corpus (in prep.) * Texts in a number of languages from the [[http://www-korpus.uni-r.de/ParaSol/|ParaSol]] corpus (in prep.) * Newspaper texts from the [[http://www.presseurop.eu|Presseurop]] server * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus (in prep.) * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug & co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] {{:cnk:intercorp:projectsyndicate.png?direct&319}} Last update: //5 October 2011// ===== See also ===== [[en:cnk:intercorp|InterCorp]] • [[en:cnk:intercorp:verze7|Release 7]] • [[en:cnk:intercorp:verze6|Release 6]] • [[en:cnk:intercorp:verze5|Release 5]] • [[en:cnk:intercorp:verze3|Release 3]] • [[en:cnk:intercorp:historie|Version history]]