~~NOTOC~~ ====== InterCorp: Release 7 ====== ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ ^ Positions ^ Number of tokens | 95 814 527 | 116 374 744 | 208 845 922 | 1 546 493 833 | ^ ::: ^ Number of word forms | 77 121 760 | 88 303 155 | 173 224 560 | 1 216 880 655 | ^ Structural attributes ^ Number of documents | 1 184 | 5 | 2 294 | 87 | ^ ::: ^ Number of div | 1 184 | 107 388 | 2 294 | 1 817 043 | ^ ::: ^ Number of sentences | 6 595 174 | 13 497 188 | 12 796 035 | 142 788 867 | ^ Further information ^ reference | YES ^^^^ ^ ::: ^ representative | NO ^^^^ ^ ::: ^ publication date | 2014 ^^^^ ^ ::: ^ foreign languages | 38 ^^^^ ^ ::: ^ tagged languages | 20 ^^^^ ^ ::: ^ lemmatized languages | 17 ^^^^ InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. ===== Access to the texts ===== After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]]. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. New release of InterCorp is published mostly each year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (in Park starting from release 5, in the other interfaces from release 6). ===== References ===== In results of your work based on InterCorp we would appreciate a link to the project site [[http://www.korpus.cz/intercorp|www.korpus.cz/intercorp]]. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//, 13(3):411–427 (bibtex((''@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411--427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}'')), [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at //ing entaConnect//]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). For more references see [[en:cnk:diakorp:bibliografie|here]]. or in the [[https://www.korpus.cz/biblio.php|repository of bibliographical items based on the CNC]]. All references to work using InterCorp is welcome. See [[https://www.korpus.cz/biblio_appeal.php|here]] for details. ===== Texts in the corpus ===== The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The choice in the present release 7 includes: These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the Acquis Communautaire and Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the Open Subtitles database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources, have been added. Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words. [{{:en:cnk:intercorp:intercorp_wordcounts2_v7.png|Setup of the parallel corpus – the core}}] [{{:en:cnk:intercorp:intercorp_wordcounts3_v7.png|Setup of the parallel corpus – collections}}] ==== Corpus size in thousands of words ==== ^ Language Code ^ Language ^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ | ar | Arabic | 34 | 0 | 0 | 0 | 0 | 0 | 34 | | be | Belarusian | 1,751 | 0 | 0 | 0 | 0 | 0 | 1,751 | | bg | Bulgarian | 4,923 | 0 | 0 | 13,816 | 9,083 | 0 | 27,823 | | ca | Catalan | 4,498 | 0 | 0 | 0 | 0 | 0 | 4,498 | | da | Danish | 1,311 | 0 | 0 | 21,680 | 13,916 | 14,430 | 51,336 | | de | German | 26,315 | 3,050 | 1,715 | 21,724 | 13,089 | 8,367 | 74,260 | | el | Greek | 0 | 0 | 0 | 25,070 | 15,404 | 23,715 | 64,188 | | en | English | 12,641 | 3,083 | 1,863 | 24,208 | 15,580 | 52,101 | 109,476 | | es | Spanish | 16,907 | 3,479 | 1,948 | 27,001 | 15,885 | 36,379 | 101,599 | | et | Estonian | 0 | 0 | 0 | 15,963 | 10,900 | 10,296 | 37,158 | | fi | Finnish | 3,054 | 0 | 0 | 16,455 | 10,175 | 15,098 | 44,782 | | fr | French | 6,976 | 3,535 | 2,054 | 27,352 | 17,178 | 25,962 | 83,057 | | he | Hebrew | 0 | 0 | 0 | 0 | 0 | 16,221 | 16,221 | | hi | Hindi | 206 | 0 | 0 | 0 | 0 | 0 | 206 | | hr | Croatian | 14,210 | 0 | 0 | 0 | 0 | 19,093 | 33,303 | | hu | Hungarian | 4,014 | 0 | 0 | 19,177 | 12,307 | 21,240 | 56,737 | | is | Icelandic | 0 | 0 | 0 | 0 | 0 | 1,585 | 1,585 | | it | Italian | 6,313 | 247 | 1,893 | 24,849 | 15,489 | 14,654 | 63,446 | | ja | Japanese | 0 | 0 | 0 | 0 | 0 | 113 | 113 | | lt | Lithuanian | 358 | 0 | 0 | 18,393 | 11,213 | 558 | 30,522 | | lv | Latvian | 1,337 | 0 | 0 | 18,745 | 11,689 | 280 | 32,051 | | mk | Macedonian | 3,221 | 0 | 0 | 0 | 0 | 1,877 | 5,098 | | ms | Malay | 0 | 0 | 0 | 0 | 0 | 3,521 | 3,521 | | mt | Maltese | 0 | 0 | 0 | 14,133 | 0 | 0 | 14,133 | | nl | Dutch | 9,370 | 0 | 2,082 | 24,746 | 15,563 | 29,363 | 81,125 | | no | Norwegian | 4,103 | 0 | 0 | 0 | 0 | 0 | 4,103 | | pl | Polish | 16,009 | 0 | 1,662 | 20,628 | 12,811 | 26,572 | 77,683 | | pt | Portuguese | 2,393 | 0 | 2,103 | 28,603 | 16,485 | 43,392 | 92,976 | | ro | Romanian | 3,156 | 0 | 1,917 | 8,200 | 9,446 | 34,129 | 56,847 | | ru | Russian | 3,308 | 2,651 | 0 | 0 | 0 | 6,886 | 12,844 | | sk | Slovak | 7,402 | 0 | 0 | 19,223 | 12,734 | 5,134 | 44,493 | | sl | Slovene | 900 | 0 | 0 | 19,646 | 12,241 | 17,025 | 49,811 | | sq | Albanian | 0 | 0 | 0 | 0 | 0 | 2,004 | 2,004 | | sr | Serbian | 8,413 | 0 | 0 | 0 | 0 | 20,777 | 29,189 | | sv | Swedish | 7,789 | 0 | 0 | 20,586 | 13,840 | 14,694 | 56,909 | | tr | Turkish | 0 | 0 | 0 | 0 | 0 | 21,191 | 21,191 | | uk | Ukrainian | 2,310 | 0 | 0 | 0 | 0 | 246 | 2,556 | | vi | Vietnamese | 0 | 0 | 0 | 0 | 0 | 1,474 | 1,474 | | **Subtotal** | 173,225 | 16,044 | 17,239 | 430,195 | 265,029 | 488,373 | 1,390,105 | | cs | Czech | 77,122 | 2,749 | 1,640 | 20,303 | 12,923 | 50,688 | 165,425 | | **Total** | 250,346 | 18,793 | 18,880 | 450,498 | 277,952 | 539,061 | 1,555,530 | N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. ===== Morphosyntactic annotation ===== Texts in the following languages have received some morphosyntactic annotation. | Language | Tags | Lemmas | Brief description | Detailed description | Tool | | Bulgarian | ✔ | [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Czech | ✔ | ✔ | [[http://ucnk.ff.cuni.cz/bonito/znacky.php|in Czech]] [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[http://ufal.mff.cuni.cz/morce/|Morče]] | | Dutch | ✔ | [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | English | ✔ | ✔ | [[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|+ additions]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus|in Estonian and English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Finnish | ✔ | ✔ | [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|in English]]((The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].)) | [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] | | French | ✔ | ✔ | [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | German | ✔ | ✔ | [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]]((Within a single tag, semicolon is used instead of comma as a separator of individual morphological categories, e.g. ADJA:Pos:Nom:Sg:Fem.)) | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | | Hungarian | ✔ | [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]] | [[http://code.google.com/p/hunpos/|HunPos]] | | Icelandic | ✔ | ✔ | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | | Italian | ✔ | ✔ | [[ftp://ftp.ims.uni-stuttgart.de/pub/corpora/italian-tagset.txt|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Lithuanian | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~skoumal/CZ-LT-CORP/tags.html|in Czech and English]] | [[http://delivery.acm.org/10.1145/1570000/1567563/p94-daudaravicius.pdf?ip=62.245.92.111&acc=OPEN&key=1B55DF923F77674F55057ED4F3766CA0&CFID=216322351&CFTOKEN=30535677&__acm__=1368273161_6cdfd16427521446a21b56c60ab855ed|in English]] | Author: [[http://senas.vdu.lt/staff/informatics/CVPDF/CV_Daudaravicius_en.pdf| Vidas Daudaravičius]] | | Norwegian | ✔ | ✔ | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] [[http://tekstlab.uio.no/obt-ny/index.html|in Norwegian]] | [[http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger|analyzer]], [[http://omilia.uio.no/obt/|tagger]] | | Polish | ✔ | ✔ | [[http://korpus.pl/en/cheatsheet/node2.html|in English]] [[http://korpus.pl/pl/cheatsheet/node2.html|in Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[http://sgjp.pl/morfeusz/|Morfeusz]], [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] | | Portuguese | ✔ | ✔ | [[http://utkl.ff.cuni.cz/%7Erosen/public/ETIQUETAS_EAGLES_REDUCIDAS.webarchive|Spanish]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Russian | ✔ | ✔ | [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]]((Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as "P-". All tags, as used in ther corpus, are listed in the brief description.)) | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Slovak | ✔ | ✔ | [[http://korpus.sk/morpho.html/|in Slovak]] | [[http://korpus.sk/files/tagset-www.pdf |in Slovak]] | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] | | Slovene | ✔ | ✔ | [[http://nl.ijs.si/ME/V3/msd/html/msd.html#SECTION05600000000000000000|English]] | [[http://nl2.ijs.si/analyze/|totale]] | | Spanish | ✔ | ✔ | [[ftp://ftp.ims.uni-stuttgart.de/pub/corpora/spanish-tagset.txt|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Swedish | ✔ | ✔ | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | See [[en:park:navod|Park Manual]] for advice on the use of tags in queries. ===== Problems, comments, suggestions ===== ... on the content of the corpus and on the search interfaces are welcome at martin.vavrin@ff.cuni.cz ===== Acknowledgements ===== We are grateful for the possibility to use the following texts and software: ==== Texts: ==== * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to Adrian Barentsen * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] \\ {{:cnk:intercorp:projectsyndicate.png?direct&319}} * Newspaper texts in a number of languages from the [[http://www.voxeurop.eu|Presseurop/VoxEurop]] server * Legal texts in EU languages from the [[http://wt.jrc.it/lt/Acquis/|JRC-ACQUIS]] corpus * Proceedings of the European Parliament from the [[http://www.statmt.org/europarl/|EuroParl]] corpus * Slovak-Czech concordances from the [[http://korpus.juls.savba.sk/|Slovak National Corpus]] * Short stories in a number of languages [[http://www.goethe.de/ins/cz/prj/m89/csindex.htm|My 1989]] from [[http://www.goethe.de/ins/cz/pra/|Goethe Institut]] * A number of texts in the Czech-Lithuanian section of the corpus and Jiří Levý's The Art of Translation in more languages – with special thanks to Patrick Corness * George Orwell's novel //1984// in a number of languages from the [[http://nl.ijs.si/ME/|Multext-East]] corpus * Ukrainian and Polish texts from the [[http://www.domeczek.pl/~polukr/|PolUkr]] corpus * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug & co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] ==== Pre-processing ==== * parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]] * Sentence splitter for Czech by Pavel Květoň * Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička * Sentence splitter Punkt for all other languages from [[http://www.nltk.org/|Natural Language Toolkit]] ==== Taggers/lemmatizers: ==== * [[http://ufal.mff.cuni.cz/morfflex|MorfFlex]], [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] and [[https://is.cuni.cz/webapps/zzp/download/140018093/?back_id=10|LanGr]] for Czech * [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish * [[http://sgjp.pl/morfeusz/|Morfeusz]] and [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] for Polish * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík) * Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová) * [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička) * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (thanks to Tomaž Erjavec) * [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] for German * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter) * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling) ==== Corpus Query Engine: ==== * [[http://www.textforge.cz/products|Manatee]] * [[https://kontext.korpus.cz|KonText]] * [[http://nlp.fi.muni.cz/trac/noske|NoSketch Engine]] * [[http://www.korpus.cz/intercorp/?lang=cs|Park]] Last update: //19 December 2014// ===== See also ===== [[en:cnk:intercorp|InterCorp]] • [[en:cnk:intercorp:verze6|Release 6]] • [[en:cnk:intercorp:verze5|Release 5]] • [[en:cnk:intercorp:verze4|Release 4]] • [[en:cnk:intercorp:verze3|Release 3]] • [[en:cnk:intercorp:historie|Version history]]