Corpus of monologues: ORATOR

The ORATOR corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The speech is usually prepared and the speaker has to fit within the given time frame. To our knowledge, there is no corpus with this kind of data available for Czech.

Transcription rules, linking to the corresponding audio track and most metadata follow the ORTOFON and ORAL corpora, structural attributes used in ORATOR are described here (Czech only). The corpus is lemmatized and morphologically tagged in the same way as the ORAL and ORTOFON corpora. The corpus is not balanced in any way.

Name	ORATOR•v1	ORATOR•v2	ORATOR•v3
Number of positions (tokens)	736 407	1 535 609	1 542 133
Number of positions (tokens) without puctuation, hesitations and interjections	578 398	1 207 255	1 212 729
Number of word forms (word)	60 952	97 816	97 680
Number of conversations recorded	318	489	489
Number of utterances	68 727	147 867	148 479
Number of unique (different) speakers	332	468	468
Length of recordings [hh:mm:ss.ms]	72:07:47.368	148:51:51.56	149:27:18.998

Corpus composition and data acquisition

The aim of the corpus is to present the different types of monologues that appear in spoken language and that we are able to capture. Thus, it is not only lectures, as is usual in this type of corpus, but also very short monologues such as introductions to various social events, toasts, welcoming guests, announcing the results of competitions, etc. The speaker often represents a particular institution, discipline or field of interest, or has a clearly defined social role. The collected material was divided into 12 types of situations.

Although these are monologues, there are also recordings with a larger number of speakers. These include mainly speeches by alternating speakers with input from a moderator introducing each speaker, or speeches following in close succession.

The original criteria for including a recording in the corpus were that it must not be a read speech and that it must take place in the presence of an audience. This means that it must not be a speech prepared for the web, as it is impossible to know whether it has been repeatedly recorded and additionally edited. It is therefore not possible to guarantee an authentic capture of a monologue under normal conditions, where the speaker is exposed to certain expectations from the audience, is influenced by their presence and by the form of the event in question. However, during the data collection we repeatedly came across the fact that on certain occasions the (partially) read form is a usual part of monologues, for example, because they are ceremonies where the form has to be observed (e.g. graduation) or even legally binding (wedding ceremony), quotations are usually part of the lecture, or the speech is interpreted. We have therefore decided to include a small number (18 recordings) of read or partially read speeches to complete the picture of monologues more comprehensively. For the same reasons, we have also selected a few (9 in total) recordings without an audience, made available to the general public via the Internet; these are lectures or New Year's speeches. Each of these types accounts for about 3% of the scope of the whole corpus.

The recordings were made at various locations in the Czech Republic or were downloaded from the internet with the consent of the spokesperson. Except for the 9 cases mentioned above, the recordings always capture the communication situation in the presence of the audience and in an authentic environment. The corpus is also not balanced by the gender of the speakers, with a predominance of men.

How to cite

Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Horký, V. – Jelínek, T. – Křivan, J.: ORATOR v3: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2025 dostupný z: https://www.korpus.cz.

Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: ORATOR v2: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from https://www.korpus.cz.

Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: ORATOR v1: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2019. Retrieved from https://www.korpus.cz.

Trace: • verze7 • psany • czesl-plain • containing • novy_dotaz • heaps • orator