AplikaceAplikace
Nastavení

This is an old revision of the document!


RomCro 2.0 - Parallel corpus of Romance languages ​​and Croatian

The project Parallel Corpus in Romance Languages and Croatian (RomCro) started in 2019 at the Chair of Romance Linguistics of the Department of Romance Studies of the Faculty of Humanities and Social Sciences, University of Zagreb. The corpus unites five Romance languages ​​(French, Portuguese, Romanian, Italian, Spanish and, recently, Catalan) and, with the addition of Croatian, makes a contribution to the existing linguistic resources for the Croatian language. It consists of literary texts from the 20th and 21st centuries, with each original-language text accompanied by translations into other languages.

The RomCro corpus was created with the support of the Faculty of Humanities and Social Sciences, University of Zagreb from 2019 to 2025. The new version was also developed as part of a project supported by the Croatian Science Foundation and funded by the European Union – NextGenerationEU (project number: MOBODL 2023 08 9511). The new version of the corpus includes three new titles in Portuguese and Croatian. Furthermore, the sixth Romance language, Catalan, has been added by integrating existing Catalan translations and incorporating three Catalan novels with translations into the other languages. Compared to the first version of the corpus (see Table 1), RomCro v.2.0 includes 54 new texts, 24,200 more translation units, and 3.7 million more words, for a total of 19.4 million words.

RomCro v.1.0RomCro v.2.0Difference
Languages 6 7 1
Translation units 142,470 166,742 24,272
Originals 27 33 6
Texts total 159 213 54
Size (in millions of words) 15.7 19.4 3.7

Table 1. Comparison between the two versions

RomCro was by UDPipe annotated according to the Universal Dependencies (UD) standard, which means that it is not only lemmatised and morphologically tagged, but its annotation includes also syntax. RomCro is made available via the KonText user query interface in a way which follows UD versions of the InterCorp parallel corpus.

How to cite RomCro:

Bikić-Carić, G., Mikelenić, B. & Bezlaj, M. (2023). Construcción del RomCro, un corpus paralelo multilingüe. Procesamiento del Lenguaje Natural, 70. Sociedad Española para el Procesamiento del Lenguaje Natural, 99-110.

Mikelenić, B., Bikić-Carić, G., Bezlaj, M., Oliver, A. & Tadić, M. (2025). RomCro v.2.0 - Parallel corpus of Romance languages ​​and Croatian, HR-CLARIN, http://hdl.handle.net/20.500.14615/2-16

* The 2023 paper describes the building of RomCro v.1.0, while the 2025 repository entry refers to RomCro v.2.0 in the HR-CLARIN repository. Please cite both sources when referring to the corpus.