Corpus of Spoken Czech ORAL2008

Name ORAL2008
Number of positions (tokens) 1 349 536
Number of positions (tokens) without punctuation and other marks 1 000 097
Number of word forms (words) 65 778
Number of recordings of dialogues 297
Number of utterances 106 941
Number of speakers 995
Length of recordings in mins. 6883

ORAL2008 is another spoken corpus available within the framework of the Czech National Corpus project. Its aim is appropriate representation of authentic spoken language. The corpus is built from material recorded in the whole of Bohemia in 2002–2007 using the same repository of recordings and their transcriptions as its predecessor, corpus ORAL2006. Individual transcriptions already included into ORAL2006 are not be re-used in ORAL2008, so that there is no intersection between the two corpora. Moreover, ORAL2008 is fully balanced according to the main sociolinguistic categories of the participating speakers: gender – male (M) / female (Z), age – under 35 (I) / over 35 (V), education – elementary or secondary (B) / university (A), and region of childhood residence (according to Bělič's division) – Central Bohemia / Northeast Bohemia / Southwest Bohemia / Czech borderland (these four regions are marginally supplemented also by Bohemian-Moravian transient region).

ORAL2008 is compiled from transcriptions of 297 recordings. All of the recordings were made in informal situations, which means the speakers knew each other and had friendly relationships. The total length of recordings is 6 883 minutes, that is almost 115 hours, and they contain a total of 1 000 097 words uttered by 995 speakers.

The recordings were made and the transcriptions carried out by students of Prague and regional universities, as well as other collaborators of the ICNC.

Martina Waclawičová (main coordinator)

Citing ORAL2008

Waclawičová, M. – Kopřivová, M. – Křen, M. – Válková, L.: ORAL2008: sociolingvisticky vyvážený korpus neformální mluvené češtiny. Ústav Českého národního korpusu FF UK, Praha 2008. Available on-line:

Waclawičová, M. – Křen, M. – Válková, L. (2009): Balanced Corpus of Informal Spoken Czech: Compilation, Design and Findings. In Proceedings of the 10th Annual Conference of the International Speech Communication Association INTERSPEECH 2009, 1819–1822, Brighton.

See also