Corpus Jerome

Corpus JEROME is a monolingual comparable corpus specifically designed for analyzing translated Czech. It comprises more than 85 million tokens (including punctuation) and includes both fiction and professional literature. As a comparable corpus, it contains, in equal amounts, both translated and non-translated Czech (however, not original in the sense of source texts!). The non-translated part serves as a reference corpus.

Corpus JEROME is lemmatized and morphologically tagged in the same way as the SYN corpora. However, its annotation includes additional information, potentially relevant for translation studies researchers: first edition (prvnivyd), sex of the author (autor_sex), sex of the translator (preklad_sex).

JEROME provides a unique source of data for translation studies scholars, linguists and basically anyone who is interested in how translated Czech looks like. It is well suited for quantitative analyses as well as small-scale qualitative case studies (e.g. a study of translations made by female translators).

Corpus JEROME also includes a subcorpus balanced according to source languages (almost equally long texts from each language). This subcorpus is inevitably smaller (5 mil. tokens), but is perfect for verifying the universality of findings, e.g. when analyzing features called translation universals.


Chlumská, L.: JEROME: jednojazyčný srovnatelný korpus pro výzkum překladové češtiny. Ústav Českého národního korpusu FF UK, Praha 2013. Available on-line:

Lucie Chlumská