Learner corpus of written academic English by advanced L2 English university students, whose L1 is Czech.

The learner corpus VESPA_CZ was created as part of the international project VESPA (The Varieties of English for Specific Purposes dAtabase ), organized by the Centre for English Corpus Linguistics, Université catholique de Louvain. The aim of the project, which was initiated in 2008, is to build a database of English academic writing by L2 English university students from various mother tongue backgrounds. The corpus will comprise a wide range of disciplines (e.g. linguistics, business, biology) and registers (e.g. essays, reports, MA dissertations). The first release of the corpus (over 2 million words, available at https://corpora.uclouvain.be/cecl/vespa/home) contains texts written by university students in The Netherlands, Belgium, Spain, Norway and Sweden). Apart from the subcorpus with L1 Czech background, subcorpora with L1 French, German, and Turkish are currently under development, as well as a comparable corpus of native English student writing.

VESPA comprises only texts collected in disciplinary content courses. The minimum length of a text is 500 words; the authors’ degrees of writer expertise in academic writing range from first-year BA to PhD students. To be included in the corpus, the texts have to be entirely the students’ own; texts produced by more than one student and revised versions of texts are avoided. The students submit the texts in electronic format, together with the learner profile questionnaire, including the permission for the text to be used for research and teaching purposes.

The Czech subcorpus VESPA_CZ was compiled in 2019-2022. It comprises English academic texts from the domains of literature (essays), linguistics and economics (term papers) which were written by university students of the BA and MA programmes ‘English and American Studies’, ‘Anglophone Literatures’ and ‘English Language’ (Faculty of Arts, Charles University, Prague) and the BA programme ‘Arts Management’ (Faculty of Business Administration, University of Economics and Business, Prague). The texts have been tagged using the VESPA macros and Perl scripts (Ebeling & Heuboeck 2007; Heuboeck et al. 2008). Text divisions and sections have been tagged, as well as quotes (quoted passages, book titles, etc., <q>), block quotes (i.e. quotes separated from the main text by indention or a new line character, <quote>), and ‘mentioned items’ (i.e. mostly linguistic examples and passages of texts analysed by students, <mentioned>).


The compilation of the corpus was supported by the Czech Science Foundation grant 19-05180S ‘Phraseology in English academic texts written by Czech advanced learners: a comparative study of learner and native speaker discourse’. The compilation would not have been possible without the support of the Institute of the Czech National Corpus and the Université catholique de Louvain VESPA team. The texts were formatted and tagged by MA and PhD students of the ‘English Language’ and ‘Anglophone Literatures and Cultures’ programmes at the Faculty of Arts, (Charles University, Prague).


How to cite VESPA_CZ

Malá, M. – Brůhová, G. – Vašků, K.: VESPA_CZ: Learner corpus of written academic English, version 1, 21 Dec 2022. Ústav Českého národního korpusu FF UK, Praha 2022. Dostupný z WWW: http://www.korpus.cz