The ORAL corpus is a corpus containing the transcribed recordings of predominantly informal conversations taking place between native speakers of Czech from all regions of the Czech Republic. The speakers knew each other very well (they were either friends or family members) and they were recorded in their natural environment. The recordings were made over the course of ten years, between 2002 and 2011. The corpus is not balanced, with the majority of the data originating from the Bohemia region of the Czech Republic (for more visit the corpus structure; Czech only). There is only one level of transcription, and wherever it was possible, it was unified along with tokenization for all parts of the corpus. The ORAL corpus unifies the corpora ORAL2006, ORAL2008, ORAL2013 and the as yet unpublished recordings ORAL-Z. The overall size of the corpus is 5 368 391 words, with a total recording time of 582 hours. Part of the transcripts are not linked to the audio (data from the corpora ORAL2006 and ORAL2008). The corpus is lemmatized and morphologically tagged. It uses the same type of morphological tagging as the contemporary written corpora.
Name | ORAL |
---|---|
Number of positions (tokens) | 6 361 707 |
Number of positions (tokens) without punctuation or comments | 5 368 392 |
Number of word forms (words) | 193 497 |
Number of recorded conversations | 1 546 |
Number of speaking turns | 696 918 |
Number of speakers | 2 807 |
Length of recordings for ORAL2013 + ORAL-Z [hh:mm:ss.ms] | 354:44:36.722 |
The corpus was created by merging and correcting data from the already existing corpora ORAL2006, ORAL2008 and ORAL2013, and by adding the ORAL-Z section, which additionally contains several recordings of formal situations. These formal situations capture communication in which one of the speakers represents an institution – e.g. job interview, conversation at the office, in the shop etc., or else it can be a prepared speech, e.g. a lecture. Information about the original corpus from which the recording was taken allows us to create an identical subcorpus with corrected data and with added lemmatization and morphological tagging.
Due to corrections and changes to tokenization, even previously published sections of the ORAL corpus have changed in size. To provide an overview and a comparison with the original corpora, we have included the size of all sections in the new corpus (number of positions without punctuation and comments / total number of positions):
The absolute values for the number of speakers according to place of birth, along with longitude and latitude coordinates, are available for download in .xlsx format.
(+)
; if the original topic was not brought up again, it is marked with a minus sign (-)
Transcription in the joint corpus ORAL retains most of the usual corpus transcription rules. However, in a number of cases they have been modified and unified 2). The transcript of sections of the ORAL-Z corpus essentially conforms to the transcription rules of the ORAL2013 corpus. The differences of the transcriptions are caused not only by errors and changed rules, but often also by the possibility of double entries in written texts.
Wherever possible, the transcription was unified in the following manner:
Sensitive personal information is encoded in the transcription according to the wishes of the recorded speakers. More detailed information and an overview of the transcription symbols can be found in the Transcription section (Czech only).
For spoken corpora we have implemented a new, graphic interface for viewing dialogues, which clearly shows the alternating speakers, captures their concurrent speech (for the ORAL2013 and ORAL-Z sections) and distinctly identifies the speaker with the help of the alias.
Balhar, J. et al. (1992) : Český jazykový atlas. Academia. Praha.
Hajič, J. – Hlaváčová, J. (2013): MorfFlex CZ. Univerzita Karlova v Praze, MFF, ÚFAL, Praha.
Straka, M. – Straková, J. – Hajič, J. (2014): Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland: Association for Computational Linguistics, 3–18.
Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P. - Waclawičová, M. - Benešová, L. – Křen, M.: ORAL: korpus neformální mluvené češtiny, verze 1 z 2. 6. 2017. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz
Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P. (2017): Korpus ORAL: sestavení, lemmatizace a morfologické značkování. In Korpus - Gramatika - Axiologie 15, 47-67.
Lukeš. D. - Klimešová, P. - Komrsková, Z. - Kopřivová, M. (2015) : Experimental Tagging of the ORAL Series Corpora: Insights on Using a Stochastic Tagger. In: TSD 2015, Ed. P. Král a V. Matoušek. Springer international Publishing, 342-350.