Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:czesl-plain [2015/10/24 11:27] – [Citing CzeSL] version in English vaclavhorky | en:cnk:czesl-plain [2018/08/07 12:52] (current) – alexandrrosen | ||
---|---|---|---|
Line 11: | Line 11: | ||
* **kval** – academic texts obtained from non-native speakers of Czech studying at Czech universities in Masters or doctoral programmes; | * **kval** – academic texts obtained from non-native speakers of Czech studying at Czech universities in Masters or doctoral programmes; | ||
* **rom** – transcripts of texts written at school by pupils and students with Romani background in communities endangered by social exclusion. | * **rom** – transcripts of texts written at school by pupils and students with Romani background in communities endangered by social exclusion. | ||
+ | |||
+ | The corpus does not include any other data about the author or about the text itself. | ||
The texts in all the three groups were produced by speakers who have not (yet) acquired the Czech linguistic skills of an adult native speaker. As an acquisition corpus the texts may serve both for research in the field of learning and teaching and for practical educational purposes. The first two datasets concern Czech as a second/ | The texts in all the three groups were produced by speakers who have not (yet) acquired the Czech linguistic skills of an adult native speaker. As an acquisition corpus the texts may serve both for research in the field of learning and teaching and for practical educational purposes. The first two datasets concern Czech as a second/ | ||
Line 18: | Line 20: | ||
The essays and handwritten school exams were collected as manuscripts, | The essays and handwritten school exams were collected as manuscripts, | ||
- | Although the CzeSL-plain corpus does not contain any linguistic annotation at the moment, its next release will include more texts (the corpus | + | |
+ | Texts of non-native speakers (the **ciz** part), extended by some newer texts, are available as the [[cnk: | ||
The number of characters in the corpus is somewhat higher than in the original texts because of codes used in the transcription of the manuscripts and the encoding of some foreign and non-standard characters. For example, the string … represents omission (...), //& | The number of characters in the corpus is somewhat higher than in the original texts because of codes used in the transcription of the manuscripts and the encoding of some foreign and non-standard characters. For example, the string … represents omission (...), //& |