Czech Spoken Corpus ORAL2006

Name ORAL2006
Number of positions (tokens) 1 312 282
Number of positions (tokens) without punctuation and other marks 1 000 798
Number of word forms (words) 64 495
Number of recordings of dialogues 221
Number of utterances 97 112
Number of speakers 754

ORAL2006 is the third spoken corpus available within the Czech National Corpus project. It captures spoken Czech from the entire area of Czech dialects in the narrow sense of the word. It is a transcription of 221 recordings from 2002–2006. All recordings were made in informal situations, which means the speakers knew each other and had friendly relationships. The total length of recordings is 6 693 minutes, that is about 111 and a half hours, and they contain a total of 1 000 798 words of 754 speakers.

The transcription and markup of the recordings took place in compliance with the previous spoken corpora of PMK and BMK rules. Therefore the sociolinguistic categories of the speakers remained the same: gender – male (M) / female (Z), age – under 35 (I) / over 35 (V), education – elementary or secondary (B) / university (A). Moreover, in this corpus, it is possible to find out the exact age of the speakers, education (elementary school, secondary school or university) and the dialect area in their childhood, that is the time, when the basis of their individual language usage is formed. According to Bělič's division, the speakers belong to the following dialect areas: Central Bohemia, Northeast Bohemia, Southwest Bohemia, Czech borderland and Bohemian-Moravian transient region.

The recordings were made and the transcriptions carried out primarily by students of Prague universities and other collaborators of the ICNC.

Marie Kopřivová and Martina Waclawičová (main coordinators)

Citing ORAL2006

Kopřivová, M. – Waclawičová, M.: ORAL2006: korpus neformální mluvené češtiny. Ústav Českého národního korpusu FF UK, Praha 2006. Available on-line:

