This is an old revision of the document!
The CzeSL-plain corpus
The learner corpus CzeSL-plain (Czech as a Second Language, plain = without annotation) is one of the results of the project Innovation of Education in the Field of Czech as a Second Language, a part of the operational program Education for Competitiveness, funded by the EU Structural Funds (ESF) and the Czech government.
The institutions involved in the creation of the corpus include: Technical University of Liberec as the beneficiary of the support, Charles University in Prague and The Association of Teachers of Czech as a Foreign Language as partners, as well as a number of elementary schools and high schools, civic associations, NGOs and other institutions as well as individual collaborators.
The corpus contains a total of about 2.3 million tokens (revised version 2; for more details see the table below). The revision was caused by presence of several tens of Czech pupils' essays in ciz part included by mistake. The original version 1 can be made available upon request. The corpus includes three subcorpora marked as three text types:
- ciz – transcripts of essays written by non-native speakers in language teaching classes of various types and levels;
- kval – academic texts obtained from non-native speakers of Czech studying at Czech universities in Masters or doctoral programmes;
- rom – transcripts of texts written at school by pupils and students with Romani background in communities endangered by social exclusion.
The corpus does not include any other data about the author or about the text itself.
The texts in all the three groups were produced by speakers who have not (yet) acquired the Czech linguistic skills of an adult native speaker. As an acquisition corpus the texts may serve both for research in the field of learning and teaching and for practical educational purposes. The first two datasets concern Czech as a second/foreign language, constituting an L2 acquisition (learner) subcorpus, while the third dataset is a subcorpus focusing on L1 acquisition (Czech is not considered to be a foreign language for students with Romani background monitored in this project). This is the very first publicly available corpus of this type for Czech.
The texts were collected in 2009–2012 mostly in schools, i.e. in a formal environment, and included in the corpus with the consent of the relevant institutions and individuals.
The essays and handwritten school exams were collected as manuscripts, scanned and transcribed into an electronic form. Academic texts by non-native speakers were obtained from the authors already in an electronic form. While these texts were not written in a class or with the aim to be included in a corpus, their final form may have been affected by an automatic spellchecker.
Texts of non-native speakers (the ciz part), extended by some newer texts, are available as the CzeSL-sgt corpus, together with metadata and automatically performed morphosyntactic and error annotation, including the identification of incorrect forms. The CzeSL-plain corpus is also available from the LINDAT-Clarin repository as AKCES 3 a AKCES4. See also CzeSL – a Learner Corpus of Czech.
Although the CzeSL-plain corpus does not contain any linguistic annotation at the moment, its next release will include more texts (the corpus is thus non-reference) and provide automatic identification of incorrect forms and morphosyntactic tags. Some of the texts included in the CzeSL-plain corpus are annotated by correct forms, error labels, morphosyntactic tags and lemmas and are due for release under a different purpose-built search interface.
The number of characters in the corpus is somewhat higher than in the original texts because of codes used in the transcription of the manuscripts and the encoding of some foreign and non-standard characters. For example, the string … represents omission (…), &priv; indicates anonymization (of a proper name), &img; indicates the place, where there was a picture in the manuscript, &unclear; stands for an unrecognized word or passage, &rdot; is a string indicating the characterr with a dot above etc.
Text Type | Number of texts (version 2 / version 1) | Number of tokens (words + punctuation; (version 2 / version 1) |
---|---|---|
ciz – essays by foreigners | 8 109 / 8 863 | 1 160 701 / 1 314 901 |
kval – academic qualification texts | 174 / 176 | 731 816 / 731 816 |
rom – texts written by Roma students | 4 105 / 4 420 | 428 161 / 428 161 |
TOTAL | 12 388 / 13 459 | 2 320 678 / 2 474 878 |
Citing CzeSL
Šebesta, K. – Bedřichová, Z. – Hana, J. – Hlaváčková, E. – Hnátková, M. – Hrdlička, M. – Janeš, P. – Jelínek, T. – Křen, M. – Lábus, V. – Lundáková, K. – Petkevič, V. – Pierscieniak, P. – Procházka, P. – Rosen, A. – Skoumalová, H. – Škodová, S. – Šormová, K. – Štindlová, B.: CZESL-PLAIN: akviziční korpus psané češtiny, zvl. přepisů písemných projevů nerodilých mluvčích, version 2 from 22 Jan 2014. Ústav Českého národního korpusu FF UK, Praha 2012. Available on-line: http://www.korpus.cz.