Differences

This shows you the differences between two versions of the page.

--- en:cnk:orator [2025/04/29 09:47] – [Corpus composition and data acquisition] martinawaclawicova
+++ en:cnk:orator [2026/01/23 11:48] (current) – [Morphological tagging of the ORATOR corpus] krivan
@@ Line 1: / Line 1: @@
 ====== Corpus of monologues: ORATOR ======
-The ORATOR corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is not balanced in any way. The speech is usually prepared and the speaker has to fit within the given time frame. To our knowledge, there is no corpus with this kind of data available for Czech.
+The ORATOR corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The speech is usually prepared and the speaker has to fit within the given time frame. To our knowledge, there is no corpus with this kind of data available for Czech.
-Transcription rules, linking to the corresponding audio track and most metadata follow the [[en:cnk:ortofon|ORTOFON]] and [[en:cnk:oral|ORAL]] corpora, structural attributes used in ORATOR are described [[pojmy:atributy_strukturni|here]] (Czech only). The corpus is [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]] in the same way as the ORAL and ORTOFON corpora.
+Transcription rules, linking to the corresponding audio track and most metadata follow the [[en:cnk:ortofon|ORTOFON]] and [[en:cnk:oral|ORAL]] corpora, structural attributes used in ORATOR are described [[pojmy:atributy_strukturni|here]] (Czech only). The corpus is [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]] in the same way as the ORAL and ORTOFON corpora. The corpus is not balanced in any way.
-An updated version 2 of this corpus was published in 2020, with more than twice as much data and featuring many small improvements in the consistency of the transcription and in the annotation of the corpus.
 <WRAP 45%>
@@ Line 20: / Line 18: @@
 ====== Corpus composition and data acquisition ======
-The aim of the corpus is to present the different types of monologues that appear in spoken language and that we are able to capture. Thus, it is not only lectures, as is usual in this type of corpus, but also very short monologues such as introductions to various social events, toasts, welcoming guests, announcing the results of competitions, etc. The speaker often represents a particular institution, discipline or field of interest, or has a clearly defined social role. The collected material was divided into 12 types of situations.
+The aim of the corpus is to document and present a range of monologue types occurring in spoken language that can be reliably captured. Unlike other corpora of this type that focus predominantly on academic lectures, this corpus also includes brief monologues such as opening remarks at social events, toasts, welcome addresses, award announcements, and similar formats. In many of these situations, the speaker represents a specific institution, professional domain, or area of interest, and often assumes a clearly defined social role. The collected material has been categorized into 12 distinct situational types.
+Although the primary focus is on monologic speech, the corpus also contains recordings involving multiple speakers. These typically consist of sequences of speeches by different speakers introduced by a moderator or delivered in rapid succession during a single event.
+The original criteria for including a recording in the corpus were that it must not be a read speech and that it must take place in the presence of an audience. This means that it must not be a speech prepared for the web, as it is impossible to know whether it has been repeatedly recorded and additionally edited. It is therefore not possible to guarantee an authentic capture of a monologue under normal conditions, where the speaker is exposed to certain expectations from the audience, is influenced by their presence and by the form of the event in question. However, during the data collection we repeatedly came across the fact that on certain occasions the (partially) read form is a usual part of monologues, for example, because they are ceremonies where the form has to be observed (e.g. graduation) or even legally binding (wedding ceremony), quotations are usually part of the lecture, or the speech is interpreted. We have therefore decided to include a small number (18 recordings) of read or partially read speeches to complete the picture of monologues more comprehensively. For the same reasons, we have also selected a few (9 in total) recordings without an audience, made available to the general public via the internet; these are lectures or New Year's speeches. Each of these types accounts for about 3% of the scope of the whole corpus.
+The recordings were made at various locations in the Czech Republic or were downloaded from the internet with the consent of the speaker. Except for the 9 cases mentioned above, the recordings always capture the communication situation in the presence of the audience and in an authentic environment. The corpus is also not balanced by the gender of the speakers, with a predominance of men.
+===== Morphological tagging of the ORATOR corpus =====
+The ORATOR v3 corpus is automatically [[en:pojmy:tag|annotated]] with [[en:cnk:syn2020#morphological_tagging|a new morphological tag]] according to the [[en:cnk:anotacni_standard_cnk|unified CNC annotation scheme]]. It recognizes [[en:cnk:syn2020#multiple_lemmatization_and_tagging_aggregate|aggregates]] (e.g., //vidělas//, //zač//), uses [[en:cnk:syn2020|double-level lemmatization]], and has a verb tag ([[en:cnk:syn2020#verb_tagging_verbtag|verbtag]]).
+Substandard variants and forms typical of dialects and spontaneous speech are also tagged in the corpus (according to the ORTOFON corpus, see [[en:cnk:ortofon#morphological_tagging_of_the_ortofon_corpus|Morphological tagging of the ORTOFON corpus]]).
+The following specific tags are used in the first tag position (word type):
+^  Tag  ^  Meaning  ^
+|  E	| fragments (incomplete words) |
+|  H    | nonverbal sounds (e.g. hezitation) |
+|  M	| comments by transcribers (in round brackets) |
+|  W	| anonymised sections (mainly names) |
+Note: The anonymised sections are specified on a basic level ''%%word%%'': NP – surname, NJ – first name, NN – nickname, NM – place name, NO – other proper names, NT – last two digits of the telephone number.
+The ORATOR v2 corpus is tagged with the prior morphological tagset used until 2020. Detailed information on the annotation of these previously published corpora can be found on a [[en:cnk:lemtag_mluv|separate page]].
+====== ORATOR v1 (2019) ======
+The ORATOR v1 corpus consists of 318 recordings of 332 speakers from 2005-2019. The length of the recordings ranges from 13 seconds to 49 minutes. Some long lectures are split into multiple parts for technical reasons.
+====== ORATOR v2 (2020) ======
-Although these are monologues, there are also recordings with a larger number of speakers. These include mainly speeches by alternating speakers with input from a moderator introducing each speaker, or speeches following in close succession.
+In 2020, the ORATOR corpus has been expanded to more than double its size (more than 1.5 million tokens). The corpus consists of 489 recordings of 468 speakers from 2005-2019. In addition to the increase in corpus size, there have also been many minor improvements in transcription consistency and annotation. The ORATOR v2 corpus is annotated with the original morphological tag.
-The original criteria for including a recording in the corpus were that it must not be a read speech and that it must take place in the presence of an audience. This means that it must not be a speech prepared for the web, as it is impossible to know whether it has been repeatedly recorded and additionally edited. It is therefore not possible to guarantee an authentic capture of a monologue under normal conditions, where the speaker is exposed to certain expectations from the audience, is influenced by their presence and by the form of the event in question. However, during the data collection we repeatedly came across the fact that on certain occasions the (partially) read form is a usual part of monologues, for example, because they are ceremonies where the form has to be observed (e.g. graduation) or even legally binding (wedding ceremony), quotations are usually part of the lecture, or the speech is interpreted. We have therefore decided to include a small number (18 recordings) of read or partially read speeches to complete the picture of monologues more comprehensively. For the same reasons, we have also selected a few (9 in total) recordings without an audience, made available to the general public via the Internet; these are lectures or New Year's speeches. Each of these types accounts for about 3% of the scope of the whole corpus.
+====== ORATOR v3 (2025) ======
-The recordings were made at various locations in the Czech Republic or were downloaded from the internet with the consent of the spokesperson. Except for the 9 cases mentioned above, the recordings always capture the communication situation in the presence of the audience and in an authentic environment. The corpus is also not balanced by the gender of the speakers, with a predominance of men.
+The ORATOR corpus in its third version contains the same recordings and transcripts as the second version (i.e. over 1.5 million tokens) but annotated according to the new Unified CNC Annotation Scheme using a language model trained also on spoken data. The ''genphone'' attribute is also newly included in the corpus, indicating the automatically generated phonetic form of a word. In addition, several transcription corrections have been made.
 ===== How to cite =====
 <WRAP round tip 70%>
-Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Horký, V. – Jelínek, T. – Křivan, J.: //ORATOR v3: Korpus monologů//. Ústav Českého národního korpusu FF UK, Praha 2025 dostupný z: [[https://www.korpus.cz]].
+Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Horký, V. – Jelínek, T. – Křivan, J.: //ORATOR: Corpus of monologues, version 3, 28. 5. 2025//. Ústav lingvistiky FF UK, Praha 2025. Retrieved from [[https://www.korpus.cz]].
-Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: //ORATOR v2: Korpus monologů//. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from [[https://www.korpus.cz]].
+Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: //ORATOR: Corpus of monologues, version 2, 18. 12. 2020//. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from [[https://www.korpus.cz]].
-Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: //ORATOR v1: Korpus monologů//. Ústav Českého národního korpusu FF UK, Praha 2019. Retrieved from [[https://www.korpus.cz]].
+Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P.: //ORATOR: Corpus of monologues, version 1, 19. 12. 2019//. Ústav Českého národního korpusu FF UK, Praha 2019. Retrieved from [[https://www.korpus.cz]].
 </WRAP>

Trace: • skript2012 • alpha • containing • segmentace • cermak • semanticka_preference • net • ulozit • syn • morfio

Differences

Search

Navigation

Print/export

Tools

Languages

Licence