Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
en:cnk:czesl-plain [2015/10/07 12:00] – vytvořeno alexandrrosen | en:cnk:czesl-plain [2018/08/07 12:52] (current) – alexandrrosen | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | [[http://ucnk.ff.cuni.cz/ | + | ~~NOTOC~~ |
+ | ====== The CzeSL-plain corpus ====== | ||
+ | |||
+ | The learner corpus **CzeSL-plain** (**Cz**ech as a **S**econd **L**anguage, | ||
+ | |||
+ | The institutions involved in the creation of the corpus include: //Technical University of Liberec// as the beneficiary of the support, //Charles University in Prague// and //The Association of Teachers of Czech as a Foreign Language// as partners, as well as a number of elementary schools and high schools, civic associations, | ||
+ | |||
+ | The corpus contains a total of about 2.3 million tokens (revised version 2; for more details see the table below). The revision was caused by presence of several tens of Czech pupils' | ||
+ | |||
+ | * **ciz** – transcripts of essays written by non-native speakers in language teaching classes of various types and levels; | ||
+ | * **kval** – academic texts obtained from non-native speakers of Czech studying at Czech universities in Masters or doctoral programmes; | ||
+ | * **rom** – transcripts of texts written at school by pupils and students with Romani background in communities endangered by social exclusion. | ||
+ | |||
+ | The corpus does not include any other data about the author or about the text itself. | ||
+ | |||
+ | The texts in all the three groups were produced by speakers who have not (yet) acquired the Czech linguistic skills of an adult native speaker. As an acquisition corpus the texts may serve both for research in the field of learning and teaching and for practical educational purposes. The first two datasets concern Czech as a second/ | ||
+ | |||
+ | The texts were collected in 2009–2012 mostly in schools, i.e. in a formal environment, | ||
+ | |||
+ | The essays and handwritten school exams were collected as manuscripts, | ||
+ | |||
+ | |||
+ | Texts of non-native speakers (the **ciz** part), extended by some newer texts, are available as the [[cnk: | ||
+ | |||
+ | The number of characters in the corpus is somewhat higher than in the original texts because of codes used in the transcription of the manuscripts and the encoding of some foreign and non-standard characters. For example, the string … represents omission (...), //& | ||
+ | |||
+ | |||
+ | ^ Text Type ^ Number of texts (version 2 / version 1) ^ Number of tokens (words + punctuation; | ||
+ | | ciz – essays by foreigners | 8 109 / 8 863 | 1 160 701 / 1 314 901 | | ||
+ | | kval – academic qualification texts | 174 / 176 | 731 816 / 731 816 | | ||
+ | | rom – texts written by Roma students | 4 105 / 4 420 | 428 161 / 428 161 | | ||
+ | | TOTAL | 12 388 / 13 459 | 2 320 678 / 2 474 878 | | ||
+ | |||
+ | |||
+ | ===== Citing | ||
+ | |||
+ | <WRAP round tip 70%> | ||
+ | Šebesta, K. – Bedřichová, | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== See also ===== | ||
+ | |||
+ | <WRAP round box 49%> | ||
+ | [[en: | ||
+ | </ |