~~NOTOC~~ ====== InterCorp: Release 6 ====== ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ ^ Positions ^ Number of tokens | 76 861 107 | 46 880 365 | 167 141 155 | 890 129 077| ^ ::: ^ Number of word forms | 61 962 499 | 37 584 764 | 138 762 949 | 728 507 959 | ^ Structural attributes ^ Number of documents | 996 | 4 | 1 939 | 56 | ^ ::: ^ Number of div | 996 | 96 988 | 1 939 | 1 728 492 | ^ ::: ^ Number of sentences | 5 254 361 | 2 392 808 | 10 283 732 | 44 113 753 | ^ Further information ^ reference | YES ^^^^ ^ ::: ^ representative | NO ^^^^ ^ ::: ^ publication date | 2013 ^^^^ ^ ::: ^ foreign languages | 31 ^^^^ ^ ::: ^ tagged languages | 17 ^^^^ ^ ::: ^ lemmatized languages | 14 ^^^^ InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC. ===== Access to the texts ===== After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. InterCorp can be accessed via a standard web browser in two ways: * InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]]. * From [[http://www.korpus.cz/Park|Park]], a purpose-built interface. A brief user manual is available [[en:park:navod|here]]. Both search interfaces are based on the [[http://www.textforge.cz/products|Manatee]] corpus engine and access identical texts. Park can also be used to search the previous version of the corpus. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. If you are interested, please contact us at the address below. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental. With each new release, its size, or even the number of languages and the extent and quality of annotation may grow. ===== References ===== In results of your work based on InterCorp we would appreciate a link to the project site [[http://www.korpus.cz/intercorp|www.korpus.cz/intercorp]]. You might also consider adding the following reference in your scientific publications: Čermák, F. and Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. //International Journal of Corpus Linguistics//, 13(3):411–427 (bibtex((''@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411--427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}'')), [[http://dx.doi.org/10.1075/ijcl.17.3.05cer|electronic edition at //ing entaConnect//]], [[http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf|preprint version]]). For more references see [[en:cnk:intercorp:bibliografie|here]]. Additional references to work using InterCorp are welcome. Please let us know at the e-mail address below. ===== Texts in the corpus ===== The **core** of InterCorp consists mostly of fiction, manually aligned. Intercorp offers also a selection of fully automatically processed texts, so-called **collections**. The current choice includes political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.presseurop.eu/|Presseurop]], a package of legal texts of the European Union form the [[http://langtech.jrc.it/JRC-Acquis.html|Acquis Communautaire]] corpus, and proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Some texts from the Acquis Communautaire a Europarl corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. Moreover, even some core texts in the current release no. 6 are temporarily aligned only automatically without manual checking. This concerns a part of texts acquired from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]]. Alignment of these texts will be checked and corrected before the next release. Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 6 from April 2013 is 138,779,000 words in the aligned foreign language texts in the core part and 728,508,000 in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following chart. The size is shown in millions of words. [{{:en:cnk:intercorp:intercorp_wordcounts2_v6.png|Setup of the parallel corpus – the core}}] [{{:en:cnk:intercorp:intercorp_wordcounts3_v6.png|Setup of the parallel corpus – collections}}] ==== Corpus size in thousands of words ==== ^ Language Code ^ Language ^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Total ^ | ar | Arabic | 29 | 0 | 0 | 0 | 0 | 29 | | be | Belarusian | 1 308 | 0 | 0 | 0 | 0 | 1 308 | | bg | Bulgarian | 3 979 | 0 | 0 | 13 816 | 9 083 | 26 879 | | ca | Catalan | 1 758 | 0 | 0 | 0 | 0 | 1 758 | | da | Danish | 190 | 0 | 0 | 21 680 | 13 916 | 35 785 | | de | German | 17 256 | 3 050 | 1 715 | 21 724 | 13 089 | 56 835 | | el | Greek | 210 | 0 | 0 | 25 070 | 15 404 | 40 683 | | en | English | 10 019 | 3 083 | 1 863 | 24 208 | 15 580 | 54 753 | | es | Spanish | 14 552 | 3 479 | 1 948 | 27 001 | 15 885 | 62 865 | | et | Estonian | 0 | 0 | 0 | 15 963 | 10 900 | 26 862 | | fi | Finnish | 2 131 | 0 | 0 | 16 667 | 10 241 | 29 040 | | fr | French | 3 816 | 3 535 | 2 054 | 27 352 | 17 178 | 53 936 | | hi | Hindi | 155 | 0 | 0 | 0 | 0 | 155 | | hr | Croatian | 12 625 | 0 | 0 | 0 | 0 | 12 625 | | hu | Hungarian | 2 511 | 0 | 0 | 19 168 | 12 307 | 33 985 | | it | Italian | 4 081 | 247 | 1 893 | 24 850 | 15 489 | 46 560 | | lt | Lithuanian | 358 | 0 | 0 | 18 433 | 11 020 | 29 811 | | lv | Latvian | 1 337 | 0 | 0 | 18 745 | 11 689 | 31 770 | | mk | Macedonian | 2 664 | 0 | 0 | 0 | 0 | 2 664 | | mt | Maltese | 0 | 0 | 0 | 14 133 | 0 | 14 133 | | nl | Dutch | 9 426 | 0 | 2 082 | 24 746 | 15 563 | 51 817 | | no | Norwegian | 2 301 | 0 | 0 | 0 | 0 | 2 301 | | pl | Polish | 12 710 | 0 | 1 660 | 20 464 | 12 805 | 47 640 | | pt | Portuguese | 2 318 | 0 | 2 103 | 28 599 | 16 481 | 49 502 | | ro | Romanian | 2 433 | 0 | 1 917 | 8 200 | 9 446 | 21 995 | | ru | Russian | 4 937 | 2 651 | 0 | 0 | 0 | 7 588 | | sk | slovenština | 8 152 | 0 | 0 | 19 222 | 12 734 | 40 108 | | sl | Slovene | 1 855 | 0 | 0 | 19 646 | 12 241 | 33 741 | | sr | Serbian | 6 972 | 0 | 0 | 0 | 0 | 6 972 | | sv | Swedish | 7 205 | 0 | 0 | 20 615 | 13 874 | 41 694 | | uk | Ukrainian | 1 493 | 0 | 0 | 0 | 0 | 1 493 | | **Total** | **** | **138 779** | **16 044** | **17 237** | **430 300** | **264 926** | **867 287** | | cs | Czech | 61 962 | 2 741 | 1 639 | 20 285 | 12 920 | 99 547 | N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. ===== Morphosyntactic annotation ===== Texts in the following languages have received some morphosyntactic annotation. | Language | Tags | Lemmas | Brief description | Detailed description | Tool | | Bulgarian | ✔ | [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Czech | ✔ | ✔ | [[http://korpus.cz/bonito/znacky.php|in Czech]] [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[http://ufal.mff.cuni.cz/morce/|Morče]] | | Dutch | ✔ | [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | English | ✔ | ✔ | [[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|+ additions]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus|Estonian and English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | French | ✔ | ✔ | [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | German | ✔ | ✔ | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Hungarian | ✔ | [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]] | [[http://code.google.com/p/hunpos/|HunPos]] | | Italian | ✔ | ✔ | [[ftp://ftp.ims.uni-stuttgart.de/pub/corpora/italian-tagset.txt|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Lithuanian | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~skoumal/CZ-LT-CORP/tags.html|in Czech and English]] | [[http://delivery.acm.org/10.1145/1570000/1567563/p94-daudaravicius.pdf?ip=62.245.92.111&acc=OPEN&key=1B55DF923F77674F55057ED4F3766CA0&CFID=216322351&CFTOKEN=30535677&__acm__=1368273161_6cdfd16427521446a21b56c60ab855ed|in English]] | Author: [[http://senas.vdu.lt/staff/informatics/CVPDF/CV_Daudaravicius_en.pdf| Vidas Daudaravičius]] | | Norwegian | ✔ | ✔ | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] [[http://tekstlab.uio.no/obt-ny/index.html|in Norwegian]] | [[http://maximos.aksis.uib.no/Aksis-wiki/Oslo-Bergen_Tagger|analyzer]], [[http://omilia.uio.no/obt/|tagger]] | | Polish | ✔ | ✔ | [[http://korpus.pl/en/cheatsheet/node2.html|in English]] [[http://korpus.pl/pl/cheatsheet/node2.html|in Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[http://sgjp.pl/morfeusz/|Morfeusz]], [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] | | Portuguese | ✔ | ✔ | [[http://utkl.ff.cuni.cz/%7Erosen/public/ETIQUETAS_EAGLES_REDUCIDAS.webarchive|Spanish]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Russian | ✔ | ✔ | [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]]((Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as "P-". All tags, as used in ther corpus, are listed in the brief description.)) | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | Slovak | ✔ | ✔ | [[http://korpus.sk/morpho.html/|in Slovak]] | [[http://korpus.sk/files/tagset-www.pdf |in Slovak]] | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] | | Slovene | ✔ | ✔ | [[http://nl.ijs.si/ME/V3/msd/html/msd.html#SECTION05600000000000000000|English]] | [[http://nl2.ijs.si/analyze/|totale]] | | Spanish | ✔ | ✔ | [[ftp://ftp.ims.uni-stuttgart.de/pub/corpora/spanish-tagset.txt|in English]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | See [[en:park:navod|Park Manual]] for advice on the use of tags in queries. ===== Problems, comments, suggestions ===== ... on the content of the corpus and on the search interfaces are welcome at martin.vavrin@ff.cuni.cz ===== Acknowledgements ===== We are grateful for the possibility to use the following texts and software: ==== Texts: ==== * Fiction in many Slavic and some other languages from[[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to Adrian Barentsen * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]]