This is an old revision of the document!

Corpus of monologues: ORATOR

Name	ORATOR•v1	ORATOR•v2
Number of positions (tokens)	736 407	1 535 609
Number of positions (tokens) without puctuation, hesitations and interjections	578 398	1 207 255
Number of word forms (word)	60 952	97 816
Number of conversations recorded	318	489
Number of utterances	68 727	147 867
Number of unique (different) speakers	332	468
Length of recordings [hh:mm:ss.ms]	72:07:47.368	148:51:51.56

The ORATOR corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is not balanced in any way. The speech is usually prepared and the speaker has to fit within the given time frame. To our knowledge, there is no corpus with this kind of data available for Czech.

Transcription rules, linking to the corresponding audio track and most metadata follow the ORTOFON and ORAL corpora, structural attributes used in ORATOR are described here (Czech only). The corpus is lemmatized and morphologically tagged in the same way as the ORAL and ORTOFON corpora.

An updated version 2 of this corpus was published in 2020, with more than twice as much data and featuring many small improvements in the consistency of the transcription and in the annotation of the corpus.

How to cite

Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: ORATOR v2: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2020 dostupný z: https://www.korpus.cz.

Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: ORATOR v1: Korpus monologů. Ústav Českého národního korpusu FF UK, Praha 2019 dostupný z: https://www.korpus.cz.

Trace: • orator

Corpus of monologues: ORATOR

How to cite

Search

Navigation

Print/export

Tools

Languages

Licence