Learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech.

History and present situation

The learner corpus LINDSEI_CZ was created as part of the international LINDSEI project, organized by the Centre for English Corpus Linguistics at Université catholique de Louvain). LINDSEI supplements the written learner corpus, the International Corpus of Learner English (ICLE), with a corpus of advanced spoken learner English. Work on the LINDSEI corpus began in 1995 and the collection of data continues to this day. The purpose of the corpus is to capture the spontaneous spoken English of advanced students with different mother tongue backgrounds. These groups then form the individual subcorpora of LINDSEI.

The first version of LINDSEI was published in 2010 (Gilquin et al. 2010)1). It was distributed on a CD-ROM with a utility program for searching and with an accompanying booklet describing the creation of the corpus, and an overview of basic data and meta data. At that time LINDSEI contained 11 subcorpora (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish and Swedish). It contained approximately 1 million words(of which appprox. 800 000 were student utterances), 554 interviews and 130 hours of recording. Since then, several more subcorpora have been added: Finish, Norwegian, Lithuanian, Turkish, Taiwanese and Czech. Currently, the Arabic, Basque and Brazilian subcorpora are being worked on. The second corpus version should therefore contain 20 national subcorpora, over 1000 interviews and 250 hours of recording. The corpus is available only in orthographic transcripts, and the publication of recordings is not being considered at the moment. The corpus is not systematically tagged. Some of the research teams have tagged their corpora for errors. Since spring 2016 a morphological tagging project is in progress.

Composition of the subcorpora

Each subcorpus contains 50 three-part interviews. The first part is a monologue dealing with a topic chosen by the student (important life experience; important film or play; important travelling experience). The second part is a conversation dealing with common topics, touching upon everyday student life, plans for the future and study experiences. In the third part the students tell a story based on 4 illustrations. Every interview lasts for approximately 15 minutes.


The interviews are orthographically transcribed. The transcripts conform to the transcription manual issued for this purpose by the Louvaine Centre for English Corpus Linguistics. Record is made of pauses, hesitation sounds, lengthened syllables, unfinished words, reiterations, overlaps and other paralinguistic sounds (cough, laugh etc.). Personal data are anonymized in the transcripts.


LINDSEI was designed as a corpus of advanced student English. The level of proficiency was determined based on institutional affiliation: the speakers had to be university students of English philology in the 3rd or higher year of study. This results in a certain imbalance, and the level of proficiency in LINDSEI is a somewhat vague variable. In the French subcorpus, an evaluation of the speakers' individual proficiency levels was later made by professional examiners. For the German subcorpus the proficieny level was given by the fact that students presented a certificate during entrance exams as proof of their language level. For the Czech and Taiwanese subcorpora, the students' proficiency levels are being currently determined by trained evaluators and IELTS examiners.


The Czech subcorpus LINDSEI_CZ was created in the years 2012–2015. Similarly to the other national subcorpora it contains fifty 15-minute recordings. The majority of these was made in the recording studio of the Institute of Phonetics FF UK, however some were made only with a dictaphone. The speakers were 3rd year (and higher) students of the English Language at the Department of English Linguistics and ELT Methodology, Faculty of Arts, Charles University. The coordinator for the entire project was PhDr. Tomáš Gráf, Ph.D. from the same institute. The speakers signed an informed consent that the data could be used for research, and then they completed a questionnaire.

Number of speakers/recordings 50
Number of women 43
Number of men 7
Average age 22,5 years(SD=1,6)
Average length of English language learning prior to English Language at university 9,9 years(SD=2,6)
Average length of time spent studying English Language at university 3,4 years (SD=0,9)
Length of time spent in English speaking countries 1,2 months (median)
Number of positions (including punctuation and special symbols) 135 366
Number of (token) word forms 2) 123 761
Number of word forms (tokens; students only) 95 904
Length of recordings (total) 12h 52min
Length of recordings (students only) 10h 38min

Metadata in the KonText interface

Abbreviation Description
doc.file recording ID
doc.introduction_topic topic chosen for the introductory monologue (Country, Film, Experience)
doc.length_A_and_B_turns total number of words (i.e. including the interviewer's utterances)
doc.length_B_turns number of word forms excluding those of the interviewer
doc.duration length (minutes:seconds)
doc.status description of the relationship between the interviewer and the student (i.e. How well they know eachother) date of recording
task.type task type, S = spontaneous monologue, F = free conversation, P = description of pictures
sp.type interviewer or student (interviewee)
sp.age age
sp.gender gender country of origin
sp.language mother tongue (first language)
sp.homelang languages spoken in the household where the student lives
sp.schooleng number of years studying English before university
sp.unieng number of years studying English at university
sp.monthseng number of months spent in English speaking country/ countries (in total)
sp.olang other foreign languages which the student uses
remark.type notes (e.g. information about experiences with other langiages, clarification, specification)


The project coordinator would like to thank The Institute of the Czech National Corpus for financial support of the project. Furthermore he would like to thank all those students who were involved in the project. He would also like to thank the collaborators and advisors from the Université catholique de Louvain and Justus-Liebig-Universität Giessen, namely Gaëtanelle Gilquin, Sylviane Granger and Sandra Götz. Special thanks goes to Sarah Gráfová for acquiring half of the recordings and to the Institute of Phonetics FF UK for allowing the use of the recording studio.

How to cite

Gráf, Tomáš (2017). LINDSEI_CZ: A Corpus of Spontaneous Spoken English of Advanced Speakers. Institute of the Czech National Corpus FF UK, Prague 2017. Available online at:

Tomáš Gráf

1) Gilquin, Gaëtanelle, Sylvie De Cock, and Sylviane Granger (2010). The Louvain International Database of Spoken English Interlanguage. Handbook and CD-ROM. Louvain-la-Neuve: Presses universitaires de Louvain.
2) Filler sounds and unfinished words have also been included; positions containing an apostrophe are counted as one token.