Corpus of informal spoken Czech ORAL2013

Name	ORAL2013
Number of positions (tokens)	3 285 508
Number of positions (tokens) without punctuation and other marks	2 785 189
Number of word forms (words)	131 246
Number of recordings of dialogues	835
Number of utterances	394 982
Number of unique (different) speakers	1 297
Length of recordings [hours:minutes]	291:11

ORAL2013 is another spoken corpus available within the framework of the Czech National Corpus project. Its aim is appropriate representation of authentic spoken language used in informal situations. Design of ORAL2013 is based on its predecessors, corpora of informal spoken Czech ORAL2006 and ORAL2008, but it also includes a number of significant improvements, namely transcription aligned with audio, pause-based punctuation and regional coverage of the whole of the Czech Republic.

ORAL2013 comprises 835 recordings from 2008–2011 that contain 2 785 189 words (i.e. 3 285 508 tokens including punctuation) uttered by 2 544 speakers, out of which 1 297 speakers are unique. The recordings were made in Bohemia, Moravia and Silesia, their total length amounts to 17 471 minutes (almost 300 hours).

In order to ensure prototypicality of spontaneous spoken language, the following requirements were met during the recording:

physical presence of speakers,
dialogues (two or more speakers),
speakers who know each other,
unpreparedness and spontaneity and
non-public and unofficial situation.

In most of the cases, the speakers were not aware of the recording. They granted the permission to include the particular recording into the CNC only afterwards.

ORAL-series corpora: common features

Apart from different punctuation model, all corpora share the same transcription guidelines (with a few insignificant differences) including anonymization of personal data. Similarly, they also share basic metadata about the recording and the individual speakers that have only been enriched for ORAL2013.

ORAL-series corpora: enhancements of ORAL2013

The transcriptions are aligned with the sound, so that a user can hear actual realization of every expression (for technical reasons, this feature is not available in Bonito 1).
Area of the whole of the Czech Republic is covered.
The corpus is representative of informal spoken Czech and it is approximately balanced. The balance is not full (e.g. the proportion of male:female speakers is not exactly 50:50, but “only” 51:49) because this would require exclusion of some of the valuable material from the corpus. Furthermore, full balance is not really needed, as the query interface allows working with relative (and thus comparable) frequencies.
Punctuation of ORAL2013 is pause-based, while its predecessors use traditional syntactic punctuation.
Overlaps are explicitly marked by the value of structural attribute prekryv.
Speakers identical across several recordings share the same “nickname” indicated by the value of structural attribute oznacenishody.
The metadata include also type of the communication situation.

Acknowledgement

We would like to thank all who took part in recording, transcription and subsequent verification of the data, mainly students of Faculty of Arts, Charles University in Prague. The material was also collected by a number of students and their supervisors from University of Hradec Králové, University of West Bohemia in Pilsen, Masaryk University and Palacký University Olomouc. Special thanks for excellent cooperation go to Hana Voralová.

— Lucie Benešová and Martina Waclawičová (main coordinators)

Citing ORAL2013

Benešová, L. – Křen, M. – Waclawičová, M.: ORAL2013: reprezentativní korpus neformální mluvené češtiny. Ústav Českého národního korpusu FF UK, Praha 2013. Available on-line: http://www.korpus.cz

Benešová, L. – Křen, M. – Waclawičová, M. (2015): Korpus spontánní mluvené češtiny ORAL2013. In Časopis pro moderní filologii, 97(1), 42–50. ISSN 0008-7386.

Válková, L. – Waclawičová, M. – Křen, M. (2012): Balanced data repository of spontaneous spoken Czech. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 3345–3349. Istanbul: ELRA. ISBN 978-2-9517408-7-7.