School Lessons Corpus SCHOLA2010

Name SCHOLA2010
Number of positions (tokens) 1 046 600
Number of positions (tokens) without punctuation and other marks 828 038 or 792 764 1)
Number of word forms (words) 64 329
Number of recordings of dialogues 204
Number of speaker turns 61 285
Number of speakers 2410
Length of recordings in mins. 86052)

The SCHOLA2010 corpus is a part of the MSM 0021620825 research project (Language as a human activity, its product and factor) conducted by the Institute of Czech Language and Theory of Communication, Faculty of Arts, Charles University in Prague. This corpus is sociologically and didactically unique, as it results from school environment and keeps record of spoken language used during lessons (especially standard lessons of c. 45 minutes). The users may thus explore language material which shows the language of teachers and pupils during the lessons. So far, it is the only publicly available corpus of this type. Another feature which makes this corpus different from other spoken corpora published by the ICNC is that it contains language of children and teenagers.

The SCHOLA2010 corpus is compiled from 204 transcriptions of lesson recordings from 2005–2008. These recordings come from different dialect areas of the Czech Republic: 131 samples were recorded in Central Bohemia, 57 in East Moravia (according to Bělič's division in the Outline of Czech Dialectology, 1972, and the division presented in the Czech Language Atlas, 1992–2005). Consequently, it provides the users with territorially diverse language material. Even though the recordings were made in formal environment, features of common spoken language can also be found in the corpus. In addition to standard Czech, the transcriptions also contain common Czech as well as some regional features. Both teachers and pupils knew that they were being recorded and they (or their parents in case of pupils) gave their consent with the recording and with the use of these recordings in research and for the purposes of the Czech National Corpus. There are 2410 unique speakers (persons) recorded in the corpus. The total length of the recordings is 143 hours and 25 minutes. The corpus contains 1 046 600 positions, that is 792 764 words (excluding punctuation and commentaries in brackets).

The SCHOLA2010 corpus could not be built without significant help from teachers and their valued cooperation on this project. In addition, students of the Faculty of Arts and the Faculty of Education, Charles University in Prague, and other collaborators from the Institute of Czech Language and Theory of Communication and the Institute of the Czech National Corpus took part in transcribing the recorded lessons, in editing and other specific tasks of the project. We would like to express our thanks to the whole team.

Karel Šebesta (project head) and Hana Goláňová (main coordinator)

Citing SCHOLA2010

Šebesta, K. – Goláňová, H. – Křen, M. – Procházka, P.: SCHOLA2010: korpus mluvené češtiny ve škole – přepisy nahrávek vyučovacích hodin na českých základních a středních školách. Ústav Českého národního korpusu FF UK, Praha 2010. Available on-line: http://www.korpus.cz

See also

143.5 hours