Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
en:cnk:lindsei_cz [2017/01/27 10:59]
Michal Křen [History and present situation]
en:cnk:lindsei_cz [2017/04/27 15:26] (current)
Michal Křen [History and present situation]
Line 5: Line 5:
 ===== History and present situation ===== ===== History and present situation =====
  
-The learner corpus LINDSEI_CZ was created as part of the international [[https://www.uclouvain.be/​en-cecl-lindsei.html|LINDSEI]] project, organized by the [[https://​www.uclouvain.be/​en-cecl.html|Centre for English Corpus Linguistics]] at [[https://​www.uclouvain.be/​en-index.html|Université catholique de Louvain]]). LINDSEI supplements the written learner corpus, the International Corpus of Learner English ([[http://​www.uclouvain.be/​en-cecl-icle.html|ICLE]]),​ with a corpus of advanced spoken learner English. Work on the LINDSEI corpus began in 1995 and the collection of data continues to this day. The purpose of the corpus is to capture the spontaneous spoken English of advanced students with different mother tongue backgrounds. These groups then form the individual subcorpora of LINDSEI.+The learner corpus LINDSEI_CZ was created as part of the international [[https://​uclouvain.be/​en/research-institutes/​ilc/​cecl/lindsei.html|LINDSEI]] project, organized by the [[https://​www.uclouvain.be/​en-cecl.html|Centre for English Corpus Linguistics]] at [[https://​www.uclouvain.be/​en-index.html|Université catholique de Louvain]]). LINDSEI supplements the written learner corpus, the International Corpus of Learner English ([[http://​www.uclouvain.be/​en-cecl-icle.html|ICLE]]),​ with a corpus of advanced spoken learner English. Work on the LINDSEI corpus began in 1995 and the collection of data continues to this day. The purpose of the corpus is to capture the spontaneous spoken English of advanced students with different mother tongue backgrounds. These groups then form the individual subcorpora of LINDSEI.
  
 The first version of LINDSEI was published in 2010 (Gilquin et al. 2010)((Gilquin,​ Gaëtanelle,​ Sylvie De Cock, and Sylviane Granger (2010). //The Louvain International Database of Spoken English Interlanguage//​. Handbook and CD-ROM. Louvain-la-Neuve:​ Presses universitaires de Louvain.)). It was distributed on a CD-ROM with a utility program for searching and with an accompanying booklet describing the creation of the corpus, and an overview of basic data and meta data. At that time LINDSEI contained 11 subcorpora (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish and Swedish). It contained approximately 1 million words(of which appprox. 800 000 were student utterances),​ 554 interviews and 130 hours of recording. Since then, several more subcorpora have been added: Finish, Norwegian, Lithuanian, Turkish, Taiwanese and Czech. Currently, the Arabic, Basque and Brazilian subcorpora are being worked on. The [[https://​www.uclouvain.be/​en-307845.html|second corpus version]] should therefore contain 20 national subcorpora, over 1000 interviews and 250 hours of recording. The corpus is available only in orthographic transcripts,​ and the publication of recordings is not being considered at the moment. The corpus is not systematically tagged. Some of the research teams have tagged their corpora for errors. Since spring 2016 a morphological tagging project is in progress. The first version of LINDSEI was published in 2010 (Gilquin et al. 2010)((Gilquin,​ Gaëtanelle,​ Sylvie De Cock, and Sylviane Granger (2010). //The Louvain International Database of Spoken English Interlanguage//​. Handbook and CD-ROM. Louvain-la-Neuve:​ Presses universitaires de Louvain.)). It was distributed on a CD-ROM with a utility program for searching and with an accompanying booklet describing the creation of the corpus, and an overview of basic data and meta data. At that time LINDSEI contained 11 subcorpora (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish and Swedish). It contained approximately 1 million words(of which appprox. 800 000 were student utterances),​ 554 interviews and 130 hours of recording. Since then, several more subcorpora have been added: Finish, Norwegian, Lithuanian, Turkish, Taiwanese and Czech. Currently, the Arabic, Basque and Brazilian subcorpora are being worked on. The [[https://​www.uclouvain.be/​en-307845.html|second corpus version]] should therefore contain 20 national subcorpora, over 1000 interviews and 250 hours of recording. The corpus is available only in orthographic transcripts,​ and the publication of recordings is not being considered at the moment. The corpus is not systematically tagged. Some of the research teams have tagged their corpora for errors. Since spring 2016 a morphological tagging project is in progress.