Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:cnk:orator [2025/05/27 12:23] – [Corpus composition and data acquisition] martinawaclawicova | en:cnk:orator [2025/06/06 13:40] (current) – [Morphological tagging of the ORATOR corpus] martinawaclawicova | ||
|---|---|---|---|
| Line 18: | Line 18: | ||
| ====== Corpus composition and data acquisition ====== | ====== Corpus composition and data acquisition ====== | ||
| - | The aim of the corpus is to document and present a range of monologue types occurring in spoken language that can be reliably captured. Unlike | + | The aim of the corpus is to document and present a range of monologue types occurring in spoken language that can be reliably captured. Unlike |
| Although the primary focus is on monologic speech, the corpus also contains recordings involving multiple speakers. These typically consist of sequences of speeches by different speakers introduced by a moderator or delivered in rapid succession during a single event. | Although the primary focus is on monologic speech, the corpus also contains recordings involving multiple speakers. These typically consist of sequences of speeches by different speakers introduced by a moderator or delivered in rapid succession during a single event. | ||
| - | The original criteria for including a recording in the corpus were that it must not be a read speech and that it must take place in the presence of an audience. This means that it must not be a speech prepared for the web, as it is impossible to know whether it has been repeatedly recorded and additionally edited. It is therefore not possible to guarantee an authentic capture of a monologue under normal conditions, where the speaker is exposed to certain expectations from the audience, is influenced by their presence and by the form of the event in question. However, during the data collection we repeatedly came across the fact that on certain occasions the (partially) read form is a usual part of monologues, for example, because they are ceremonies where the form has to be observed (e.g. graduation) or even legally binding (wedding ceremony), quotations are usually part of the lecture, or the speech is interpreted. We have therefore decided to include a small number (18 recordings) of read or partially read speeches to complete the picture of monologues more comprehensively. For the same reasons, we have also selected a few (9 in total) recordings without an audience, made available to the general public via the Internet; these are lectures or New Year's speeches. Each of these types accounts for about 3% of the scope of the whole corpus. | + | The original criteria for including a recording in the corpus were that it must not be a read speech and that it must take place in the presence of an audience. This means that it must not be a speech prepared for the web, as it is impossible to know whether it has been repeatedly recorded and additionally edited. It is therefore not possible to guarantee an authentic capture of a monologue under normal conditions, where the speaker is exposed to certain expectations from the audience, is influenced by their presence and by the form of the event in question. However, during the data collection we repeatedly came across the fact that on certain occasions the (partially) read form is a usual part of monologues, for example, because they are ceremonies where the form has to be observed (e.g. graduation) or even legally binding (wedding ceremony), quotations are usually part of the lecture, or the speech is interpreted. We have therefore decided to include a small number (18 recordings) of read or partially read speeches to complete the picture of monologues more comprehensively. For the same reasons, we have also selected a few (9 in total) recordings without an audience, made available to the general public via the internet; these are lectures or New Year's speeches. Each of these types accounts for about 3% of the scope of the whole corpus. |
| - | The recordings were made at various locations in the Czech Republic or were downloaded from the internet with the consent of the spokesperson. Except for the 9 cases mentioned above, the recordings always capture the communication situation in the presence of the audience and in an authentic environment. The corpus is also not balanced by the gender of the speakers, with a predominance of men. | + | The recordings were made at various locations in the Czech Republic or were downloaded from the internet with the consent of the speaker. Except for the 9 cases mentioned above, the recordings always capture the communication situation in the presence of the audience and in an authentic environment. The corpus is also not balanced by the gender of the speakers, with a predominance of men. |
| + | |||
| + | ===== Morphological tagging of the ORATOR corpus ===== | ||
| + | |||
| + | The ORATOR v3 corpus is automatically [[en: | ||
| + | |||
| + | Substandard variants and forms typical of dialects and spontaneous speech are also tagged in the corpus (according to the ORTOFON corpus, see [[en: | ||
| + | |||
| + | The following specific tags are used in the first tag position (word type): | ||
| + | |||
| + | ^ Tag ^ Meaning | ||
| + | | E | fragments (incomplete words) | | ||
| + | | H | nonverbal sounds (e.g. hezitation) | | ||
| + | | M | comments by transcribers (in round brackets) | | ||
| + | | W | anonymised sections (mainly names) | | ||
| + | |||
| + | Note: The anonymised sections are specified on a basic level '' | ||
| + | |||
| + | The ORATOR v2 corpus is tagged with the prior morphological tagset used until 2020. Detailed information on the annotation of these previously published corpora can be found on a [[en: | ||
| ====== ORATOR v1 (2019) ====== | ====== ORATOR v1 (2019) ====== | ||
| Line 36: | Line 54: | ||
| ====== ORATOR v3 (2025) ====== | ====== ORATOR v3 (2025) ====== | ||
| - | The ORATOR corpus in its third version contains the same recordings and transcripts as the second version (i.e. over 1.5 million tokens), but they are newly annotated according to the SYN2020 standard. The genphone attribute is also newly included in the corpus, indicating the automatically generated phonetic form of a word. In addition, | + | The ORATOR corpus in its third version contains the same recordings and transcripts as the second version (i.e. over 1.5 million tokens), but they are newly annotated according to the SYN2020 standard. The genphone attribute is also newly included in the corpus, indicating the automatically generated phonetic form of a word. In addition, |
| ===== How to cite ===== | ===== How to cite ===== | ||
| <WRAP round tip 70%> | <WRAP round tip 70%> | ||
| - | Kopřivová, | + | Kopřivová, |
| - | Kopřivová, | + | Kopřivová, |
| - | Kopřivová, | + | Kopřivová, |
| </ | </ | ||