AplikaceAplikace
Nastavení

This is an old revision of the document!


Corpus of monologues: ORATOR

The ORATOR corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is not balanced in any way. The speech is usually prepared and the speaker has to fit within the given time frame. To our knowledge, there is no corpus with this kind of data available for Czech.

Transcription rules, linking to the corresponding audio track and most metadata follow the ORTOFON and ORAL corpora, structural attributes used in ORATOR are described here (Czech only). The corpus is lemmatized and morphologically tagged in the same way as the ORAL and ORTOFON corpora.

Name ORATOR
Number of positions (tokens) 736 407
Number of positions (tokens) without puctuation, hesitations and interjections 578 398
Number of word forms (word) 60 952
Number of conversations recorded 318
Number of utterances 68 727
Number of unique (different) speakers 332
Length of recordings [hh:mm:ss.ms] 72:07:47.368