This is an old revision of the document!
Table of Contents
Corpus of monologues: ORATOR
The ORATOR corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is not balanced in any way. The speech is usually prepared and the speaker has to fit within the given time frame. To our knowledge, there is no corpus with this kind of data available for Czech.
Transcription rules, linking to the corresponding audio track and most metadata follow the ORTOFON and ORAL corpora, structural attributes used in ORATOR are described here (Czech only). The corpus is lemmatized and morphologically tagged in the same way as the ORAL and ORTOFON corpora.
An updated version 2 of this corpus was published in 2020, with more than twice as much data and featuring many small improvements in the consistency of the transcription and in the annotation of the corpus.
| Name | ORATOR•v1 | ORATOR•v2 | ORATOR•v3 |
|---|---|---|---|
| Number of positions (tokens) | 736 407 | 1 535 609 | 1 542 133 |
| Number of positions (tokens) without puctuation, hesitations and interjections | 578 398 | 1 207 255 | 1 212 729 |
| Number of word forms (word) | 60 952 | 97 816 | 97 680 |
| Number of conversations recorded | 318 | 489 | 489 |
| Number of utterances | 68 727 | 147 867 | 148 479 |
| Number of unique (different) speakers | 332 | 468 | 468 |
| Length of recordings [hh:mm:ss.ms] | 72:07:47.368 | 148:51:51.56 | 149:27:18.998 |
Corpus composition and data acquisition
The aim of the corpus is to present the different types of monologues that appear in spoken language and that we are able to capture. Thus, it is not only lectures, as is usual in this type of corpus, but also very short monologues such as introductions to various social events, toasts, welcoming guests, announcing the results of competitions, etc. The speaker often represents a particular institution, discipline or field of interest, or has a clearly defined social role. The collected material was divided into 12 types of situations (see table Attributes for the ORATOR corpus: recording data on the available in the CNK corpora).
Although these are monologues, there are also recordings with a larger number of speakers. These include mainly speeches by alternating speakers with input from a moderator introducing each speaker, or speeches following in close succession.
How to cite
Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Horký, V. – Jelínek, T. – Křivan, J.: ORATOR v3: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2025 dostupný z: https://www.korpus.cz.
Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: ORATOR v2: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from https://www.korpus.cz.
Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: ORATOR v1: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2019. Retrieved from https://www.korpus.cz.