Table of Contents
Learner corpus of spontaneous spoken English by advanced speakers, whose L1 is Czech.
History and present situation
The learner corpus LINDSEI_CZ was created as part of the international LINDSEI project, organized by the Centre for English Corpus Linguistics at Université catholique de Louvain). LINDSEI supplements the written learner corpus, the International Corpus of Learner English (ICLE), with a corpus of advanced spoken learner English. Work on the LINDSEI corpus began in 1995 and the collection of data continues to this day. The purpose of the corpus is to capture the spontaneous spoken English of advanced students with different mother tongue backgrounds. These groups then form the individual subcorpora of LINDSEI.
The first version of LINDSEI was published in 2010 (Gilquin et al. 2010)1). It was distributed on a CD-ROM with a utility program for searching and with an accompanying booklet describing the creation of the corpus, and an overview of basic data and meta data. At that time LINDSEI contained 11 subcorpora (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish and Swedish). It contained approximately 1 million words(of which appprox. 800 000 were student utterances), 554 interviews and 130 hours of recording. Since then, several more subcorpora have been added: Finish, Norwegian, Lithuanian, Turkish, Taiwanese and Czech. Currently, the Arabic, Basque and Brazilian subcorpora are being worked on. The second corpus version should therefore contain 20 national subcorpora, over 1000 interviews and 250 hours of recording. The corpus is available only in orthographic transcripts, and the publication of recordings is not being considered at the moment. The corpus is not systematically tagged. Some of the research teams have tagged their corpora for errors. Since spring 2016 a morphological tagging project is in progress.
Composition of the subcorpora
Each subcorpus contains 50 three-part interviews. The first part is a monologue dealing with a topic chosen by the student (important life experience; important film or play; important travelling experience). The second part is a conversation dealing with common topics, touching upon everyday student life, plans for the future and study experiences. In the third part the students tell a story based on 4 illustrations. Every interview lasts for approximately 15 minutes.
The interviews are orthographically transcribed. The transcripts conform to the transcription manual issued for this purpose by the Louvaine Centre for English Corpus Linguistics. Record is made of pauses, hesitation sounds, lengthened syllables, unfinished words, reiterations, overlaps and other paralinguistic sounds (cough, laugh etc.). Personal data are anonymized in the transcripts.
LINDSEI was designed as a corpus of advanced student English. The level of proficiency was determined based on institutional affiliation: the speakers had to be university students of English philology in the 3rd or higher year of study. This results in a certain imbalance, and the level of proficiency in LINDSEI is a somewhat vague variable. In the French subcorpus, an evaluation of the speakers' individual proficiency levels was later made by professional examiners. For the German subcorpus the proficieny level was given by the fact that students presented a certificate during entrance exams as proof of their language level. For the Czech and Taiwanese subcorpora, the students' proficiency levels are being currently determined by trained evaluators and IELTS examiners.
The Czech subcorpus LINDSEI_CZ was created in the years 2012–2015. Similarly to the other national subcorpora it contains fifty 15-minute recordings. The majority of these was made in the recording studio of the Institute of Phonetics FF UK, however some were made only with a dictaphone. The speakers were 3rd year (and higher) students of the English Language at the Department of English Linguistics and ELT Methodology, Faculty of Arts, Charles University. The coordinator for the entire project was PhDr. Tomáš Gráf, Ph.D. from the same institute. The speakers signed an informed consent that the data could be used for research, and then they completed a questionnaire.
|Number of speakers/recordings||50|
|Number of women||43|
|Number of men||7|
|Average age||22,5 years(SD=1,6)|
|Average length of English language learning prior to English Language at university||9,9 years(SD=2,6)|
|Average length of time spent studying English Language at university||3,4 years (SD=0,9)|
|Length of time spent in English speaking countries||1,2 months (median)|
Metadata in the KonText interface
|doc.introduction_topic||topic chosen for the introductory monologue (Country, Film, Experience)|
|doc.length_A_and_B_turns||total number of words (i.e. including the interviewer's utterances)|
|doc.length_B_turns||number of word forms excluding those of the interviewer|
|doc.status||description of the relationship between the interviewer and the student (i.e. How well they know eachother)|
|doc.date||date of recording|
|task.type||task type, S = spontaneous monologue, F = free conversation, P = description of pictures|
|sp.type||interviewer or student (interviewee)|
|sp.country||country of origin|
|sp.language||mother tongue (first language)|
|sp.homelang||languages spoken in the household where the student lives|
|sp.schooleng||number of years studying English before university|
|sp.unieng||number of years studying English at university|
|sp.monthseng||number of months spent in English speaking country/ countries (in total)|
|sp.olang||other foreign languages which the student uses|
|remark.type||notes (e.g. information about experiences with other langiages, clarification, specification)|
The project coordinator would like to thank The Institute of the Czech National Corpus for financial support of the project. Furthermore he would like to thank all those students who were involved in the project. He would also like to thank the collaborators and advisors from the Université catholique de Louvain and Justus-Liebig-Universität Giessen, namely Gaëtanelle Gilquin, Sylviane Granger and Sandra Götz. Special thanks goes to Sarah Gráfová for acquiring half of the recordings and to the Institute of Phonetics FF UK for allowing the use of the recording studio.
How to cite
Gráf, Tomáš (2017). LINDSEI_CZ: A Corpus of Spontaneous Spoken English of Advanced Speakers. Institute of the Czech National Corpus FF UK, Prague 2017. Available online at: http://www.korpus.cz
— Tomáš Gráf