Skrýt
Nastavení

InterCorp: Release 3

InterCorp is a large parallel synchronic corpus covering a number of languages. The corpus is compiled mostly by teachers and students of the Faculty of Arts, Charles University in Prague, and by other collaborators of the ICNC.

After registration here the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.

There are several aspects that make InterCorp special among the corpora published by ICNC:

  • InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus KonText. A tutorial is available in Czech and a brief summary also in English.
  • InterCorp is unique in yet another aspect. Unlike most other ICNC corpora which are static (unchanged in time), InterCorp is incremental with its size and the number of languages growing.

Texts in the corpus

The bulk of InterCorp consists of fiction in Czech and other languages, semi-automatically aligned, and a selection of political commentaries published by Project Syndicate. The currently available Czech, English, French, German, Russian and Spanish issues, dated 2000-2008, will be followed by more recent texts in future releases of the corpus. These texts have been aligned automatically: search results may include more misaligned segments.

Each texts has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The size of InterCorp for all languages is given in the following table, the Project Syndicate data being included in the counts with their size of approximately 1,5 - 2 million words for a given language (as of February 2011, in InterCorp v. 3, see Version history).

language number of words (in thousands) number of texts
Bulgarian 1,135 15
Croatian 6,735 96
Danish 190 5
Dutch 3,914 58
English 5,695 Syndicate + 49
Finnish 1,247 19
French 3,141 Syndicate + 21
German 8,846 Syndicate + 100
Hungarian 1,123 17
Italian 2,817 28
Latvian 1,085 33
Lithuanian 353 17
Norwegian 2,158 21
Polish 4,716 80
Portuguese 1,312 18
Romanian 671 5
Russian 2,951 Syndicate + 25
Serbian 1,724 27
Slovak 6,899 138
Slovene 992 16
Spanish 10,905 Syndicate + 108
Swedish 3,673 47
TOTAL 72,280 943

Each Czech text is counted only once, even though it may have more than one foreign counterpart.

Language number of words (in thousands) number of texts
Czech 41,340 Syndicate + 652

Morphosyntactic annotation

Texts in the following languages have received some morphosyntactic annotation.

language tags lemmas brief description detailed description tool
English in English in English + additions TreeTagger
Bulgarian in English TreeTagger
Czech in Czech in English1) in English Morče
French in English TreeTagger
Italian in English TreeTagger
Lithuanian in Czech and English Vidas Daudaravičius
Hungarian in English HunPos
German in German in German TreeTagger
Dutch TreeTagger
Norwegian analyzer, tagger
Polish in English in Polish in English Morfeusz, TaKIPI
Russian in English in English2) TreeTagger
Slovak in Slovak in Slovak Radovan Garabík, Morče
Spanish in English TreeTagger

See Park Manual for advice on the use of tags in queries.

Please note: work in progress

The search interface Park is under continuous development. It is therefore likely that you will encounter inconveniences of various sorts or will miss features available in monolingual concordancers. Your error reports, reminders and suggestions are welcome at:

martin.vavrin@ff.cuni.cz

Acknowledgements

We are grateful for the possibility to use the following software and data:

Pre-processing

  • Sentence splitter for Czech by Pavel Květoň
  • Sentence splitter for Norwegian by Jarle Ebeling and Pavel Vondřička
  • Sentence splitter Punkt for all other languages from Natural Language Toolkit
  • Aligner Hunalign

Taggers/lemmatizers:

Corpus Query Engine:

Data:

Last update: 24 February 2011

See also

1)
There is a helper application to assist you with queries including Czech morphological tags. Click here.
2)
Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.