Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:oral [2017/07/04 21:37] – [Zobrazení] veronikapojarova | en:cnk:oral [2023/11/20 12:35] (current) – [ORAL Corpus] michalkren | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== ORAL Corpus | ====== ORAL Corpus | ||
- | The ORAL corpus is a corpus containing the transcribed recordings of predominantly informal conversations taking place between native speakers of Czech from all regions of the Czech Republic. The speakers knew each other very well (they were either friends or family members) and they were recorded in their natural environment. The recordings were made over the course of ten years, between 2002 and 2011. The corpus is not balanced, with the majority of the data originating from the Bohemia region of the Czech Republic (for more visit the [[en:cnk: | + | The ORAL corpus is a corpus containing the transcribed recordings of predominantly informal conversations taking place between native speakers of Czech from all regions of the Czech Republic. The speakers knew each other very well (they were either friends or family members) and they were recorded in their natural environment. The recordings were made over the course of ten years, between 2002 and 2011. The corpus is not balanced, with the majority of the data originating from the Bohemia region of the Czech Republic (for more visit the [[cnk: |
The ORAL corpus unifies the corpora [[en: | The ORAL corpus unifies the corpora [[en: | ||
Line 10: | Line 10: | ||
^ Number of [[en: | ^ Number of [[en: | ||
^ Number of [[en: | ^ Number of [[en: | ||
- | ^ Number of unique (different) | + | ^ Number of speakers | |
^ Length of recordings for ORAL2013 + ORAL-Z [hh: | ^ Length of recordings for ORAL2013 + ORAL-Z [hh: | ||
</ | </ | ||
Line 34: | Line 34: | ||
* marking **identical speakers**: in the recordings made in the years 2002–2007 (corpora ORAL2006, ORAL2008 and ORAL-Z), any cases of identical speakers were later connected, and in recordings from the years 2008–2011 (ORAL2013 corpus) this congruence had already been marked; identical speakers across both time periods were not marked | * marking **identical speakers**: in the recordings made in the years 2002–2007 (corpora ORAL2006, ORAL2008 and ORAL-Z), any cases of identical speakers were later connected, and in recordings from the years 2008–2011 (ORAL2013 corpus) this congruence had already been marked; identical speakers across both time periods were not marked | ||
* adding an **alias** for the identification of the same speaker: every single speaker in the ORAL corpus is labelled with a randomly chosen Czech first name of the corresponding gender + identification number (e.g. Simona_450)((In the ORAL2013 corpus the alias was formed by a randomly generated string of letters ending with a vowel for women and a consonant for men.)) | * adding an **alias** for the identification of the same speaker: every single speaker in the ORAL corpus is labelled with a randomly chosen Czech first name of the corresponding gender + identification number (e.g. Simona_450)((In the ORAL2013 corpus the alias was formed by a randomly generated string of letters ending with a vowel for women and a consonant for men.)) | ||
- | * newly added **employment** for all speakers based on the classification of employment and **the percentage of the given speaker' | + | * newly added **employment** for all speakers based on the classification of employment and **the percentage of the given speaker' |
* the **binary categories** remain the same for | * the **binary categories** remain the same for | ||
Line 43: | Line 43: | ||
==== Modification of segmentation ==== | ==== Modification of segmentation ==== | ||
* the maximum **segment length** for recordings linked to audio from the ORAL2013 section of the corpus is 15 words, and 25 words for the ORAL-Z section (made longer in order for the given section to be heard better); transcripts without audio are segmented into speaker turns (one speaker' | * the maximum **segment length** for recordings linked to audio from the ORAL2013 section of the corpus is 15 words, and 25 words for the ORAL-Z section (made longer in order for the given section to be heard better); transcripts without audio are segmented into speaker turns (one speaker' | ||
- | * a **turn which was interrupted** by the second speaker, following which the original topic was **reastablished** is marked with a plus sign '' | + | * a **turn which was interrupted** by the second speaker, following which the original topic was **reestablished** is marked with a plus sign '' |
* **punctuation** in the ORAL2013 and ORAL-Z sections is pause-based; | * **punctuation** in the ORAL2013 and ORAL-Z sections is pause-based; | ||
==== Modification of transcription ==== | ==== Modification of transcription ==== | ||
Line 49: | Line 49: | ||
Wherever possible, the transcription was unified in the following manner: | Wherever possible, the transcription was unified in the following manner: | ||
- | *** written together: | + | *** written together: |
*** written separately**: | *** written separately**: | ||
*** written with a lower case first letter**: names of beverages (// | *** written with a lower case first letter**: names of beverages (// | ||
- | Sensitive personal information is [[en: | + | Sensitive personal information is encoded in the transcription according to the wishes of the recorded speakers. More detailed information and an overview of the transcription symbols can be found in the [[cnk: |
- | More detailed information and an overview of the transcription symbols can be found in the [[en:cnk: | + | |
===== View ===== | ===== View ===== | ||
- | For spoken corpora we have implemented a new, graphic interface for viewing dialogues, which clearly shows the alternating speakers, captures their concurrent speech (for the ORAL2013 and ORAL-Z sections) and which distinctly identifies the speaker with the help of the alias. | + | For spoken corpora we have implemented a new, graphic interface for viewing dialogues, which clearly shows the alternating speakers, captures their concurrent speech (for the ORAL2013 and ORAL-Z sections) and distinctly identifies the speaker with the help of the alias. |
Line 75: | Line 74: | ||
<WRAP round tip 80%> | <WRAP round tip 80%> | ||
- | Kopřivová, | + | Kopřivová, |
- | Kopřivová, | + | Kopřivová, |
Lukeš. D. - Klimešová, | Lukeš. D. - Klimešová, | ||
Line 85: | Line 84: | ||
<WRAP round box 72%> | <WRAP round box 72%> | ||
- | [[en: | + | [[en: |
</ | </ |