Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:oral [2017/07/04 18:31] – [Úprava sociolingvistických údajů] veronikapojarova | en:cnk:oral [2023/11/20 12:35] (current) – [ORAL Corpus] michalkren |
---|
====== ORAL Corpus ====== | ====== ORAL Corpus ====== |
The ORAL corpus is a corpus containing the transcribed recordings of predominantly informal conversations taking place between native speakers of Czech from all regions of the Czech Republic. The speakers knew each other very well (they were either friends or family members) and they were recorded in their natural environment. The recordings were made over the course of ten years, between 2002 and 2011. The corpus is not balanced, with the majority of the data originating from the Bohemia region of the Czech Republic (for more visit the [[en:cnk:struktura_oral|corpus structure]]). There is only one level of transcription, and wherever it was possible, it was unified along with tokenization for all parts of the corpus. | The ORAL corpus is a corpus containing the transcribed recordings of predominantly informal conversations taking place between native speakers of Czech from all regions of the Czech Republic. The speakers knew each other very well (they were either friends or family members) and they were recorded in their natural environment. The recordings were made over the course of ten years, between 2002 and 2011. The corpus is not balanced, with the majority of the data originating from the Bohemia region of the Czech Republic (for more visit the [[cnk:struktura_oral|corpus structure]]; Czech only). There is only one level of transcription, and wherever it was possible, it was unified along with tokenization for all parts of the corpus. |
The ORAL corpus unifies the corpora [[en:cnk:oral2006|ORAL2006]], [[en:cnk:oral2008|ORAL2008]], [[en:cnk:oral2013|ORAL2013]] and the as yet unpublished recordings ORAL-Z. The overall size of the corpus is 5 368 391 words, with a total recording time of 582 hours. Part of the transcripts are not linked to the audio (data from the corpora ORAL2006 and ORAL2008). The corpus is [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]]. It uses the same type of [[en:seznamy:tagy|morphological tagging]] as the contemporary written corpora. | The ORAL corpus unifies the corpora [[en:cnk:oral2006|ORAL2006]], [[en:cnk:oral2008|ORAL2008]], [[en:cnk:oral2013|ORAL2013]] and the as yet unpublished recordings ORAL-Z. The overall size of the corpus is 5 368 391 words, with a total recording time of 582 hours. Part of the transcripts are not linked to the audio (data from the corpora ORAL2006 and ORAL2008). The corpus is [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]]. It uses the same type of [[en:seznamy:tagy|morphological tagging]] as the contemporary written corpora. |
| |
^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|recorded conversations]] | 1 546 | | ^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|recorded conversations]] | 1 546 | |
^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|speaking turns]] | 696 918 | | ^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|speaking turns]] | 696 918 | |
^ Number of unique (different) speakers | 1 297 | | ^ Number of speakers | 2 807 | |
^ Length of recordings for ORAL2013 + ORAL-Z [hh:mm:ss.ms] | 354:44:36.722 | | ^ Length of recordings for ORAL2013 + ORAL-Z [hh:mm:ss.ms] | 354:44:36.722 | |
</WRAP> | </WRAP> |
* marking **identical speakers**: in the recordings made in the years 2002–2007 (corpora ORAL2006, ORAL2008 and ORAL-Z), any cases of identical speakers were later connected, and in recordings from the years 2008–2011 (ORAL2013 corpus) this congruence had already been marked; identical speakers across both time periods were not marked | * marking **identical speakers**: in the recordings made in the years 2002–2007 (corpora ORAL2006, ORAL2008 and ORAL-Z), any cases of identical speakers were later connected, and in recordings from the years 2008–2011 (ORAL2013 corpus) this congruence had already been marked; identical speakers across both time periods were not marked |
* adding an **alias** for the identification of the same speaker: every single speaker in the ORAL corpus is labelled with a randomly chosen Czech first name of the corresponding gender + identification number (e.g. Simona_450)((In the ORAL2013 corpus the alias was formed by a randomly generated string of letters ending with a vowel for women and a consonant for men.)) | * adding an **alias** for the identification of the same speaker: every single speaker in the ORAL corpus is labelled with a randomly chosen Czech first name of the corresponding gender + identification number (e.g. Simona_450)((In the ORAL2013 corpus the alias was formed by a randomly generated string of letters ending with a vowel for women and a consonant for men.)) |
* newly added **employment** for all speakers based on the classification of employment and **the percentage of the given speaker's share** in the number of tokens (positions in the corpus) in the recording (see [[en:pojmy:atributy_strukturni#atributy_spolecne_vsem_korpusum_rady_oral|speaker details]]) | * newly added **employment** for all speakers based on the classification of employment and **the percentage of the given speaker's share** in the number of tokens (positions in the corpus) in the recording |
| |
* the **binary categories** remain the same for | * the **binary categories** remain the same for |
* education: lower (primary school, high school) and higher (university education - including unfinished) | * education: lower (primary school, high school) and higher (university education - including unfinished) |
| |
==== Úprava segmentace ==== | ==== Modification of segmentation ==== |
* maximální **délka segmentů** u nahrávek spojených se zvukem z části korpusu ORAL2013 je 15 slov, u části ORAL-Z 25 slov (prodlouženo pro lepší poslech příslušného úseku); transkripty bez zvuku jsou členěny na repliky (úsek řeči jednoho mluvčího, než je vystřídán komunikačním partnerem) | * the maximum **segment length** for recordings linked to audio from the ORAL2013 section of the corpus is 15 words, and 25 words for the ORAL-Z section (made longer in order for the given section to be heard better); transcripts without audio are segmented into speaker turns (one speaker's section of speech before he is superseded by his communication partner) |
* **přerušení repliky** druhým mluvčím, po kterém došlo k **navázání** na původní téma, se označuje znaménkem plus ''(+)''; pokud nedošlo k navázání na původní téma, znaménkem minus ''(-)'' | * a **turn which was interrupted** by the second speaker, following which the original topic was **reestablished** is marked with a plus sign ''(+)''; if the original topic was not brought up again, it is marked with a minus sign ''(-)'' |
* **interpunkce** v částech ORAL2013 a ORAL-Z je pauzová; syntaktická interpunkce, užívaná pro korpusy ORAL2006 a ORAL2008, byla změněna následujícím způsobem: čárky byly smazány bez náhrady, tečky byly nahrazeny čárkami | * **punctuation** in the ORAL2013 and ORAL-Z sections is pause-based; syntactic punctuation, used in the ORAL2006 and ORAL2008 corpora, was altered in the following way: commas were deleted with no replacement, full stops were replaced by commas |
==== Úprava transkripce ==== | ==== Modification of transcription ==== |
Transkripce ve spojeném korpusu ORAL zachovává většinu transkripčních zásad platných pro korpusy, v některých případech však došlo k jejich úpravě a sjednocení ((Všechny již publikované korpusy zároveň zůstávají v referenční, neměnné podobě.)). Přepis dat z části ORAL-Z odpovídá v podstatě transkripčním zásadám korpusu {{:cnk:prepisovaci_pravidla_oral2013.pdf|ORAL2013}}. Rozdílnost transkripce je způsobena nejen chybami a změnou pravidel, ale často i možností dubletního zápisu v psaných textech. | Transcription in the joint corpus ORAL retains most of the usual corpus transcription rules. However, in a number of cases they have been modified and unified ((All previously published corpora simultaneously remain in a referential, unaltered form.)). The transcript of sections of the ORAL-Z corpus essentially conforms to the transcription rules of the {{:cnk:prepisovaci_pravidla_oral2013.pdf|ORAL2013}} corpus. The differences of the transcriptions are caused not only by errors and changed rules, but often also by the possibility of double entries in written texts. |
| |
Tam, kde to bylo možné, byla transkripce sjednocována následujícím způsobem: | Wherever possible, the transcription was unified in the following manner: |
*** psaní dohromady:** slova cizího původu (//nonstop, secondhand//), citátová spojení (//apriori, defacto//), spřežky s možností dvojího zápisu (//bezesporu, načerno, vodmalička//), číslovky s komponentem krát (//čtyřikrát//), substantivizované číslovky (//dvacetdevítka//), spojky (//anebo, abysem//), citoslovce (//bubu, čičí, díkybohu//), | *** written together:** foreign origin words (//nonstop, secondhand//), quoted phrases (//apriori, defacto//), digraphs with two possible spellings (//bezesporu, načerno, vodmalička//), numerals with the component "krát" (//čtyřikrát//), substantivized numerals (//dvacetdevítka//), conjuncts (//anebo, abysem//), interjections (//bubu, čičí, díkybohu//), |
*** psaní zvlášť**: víceslovné kontaktové výrazy (//no no; prosim tě//), spojky (//i když//), číslovky (//čtyři sta, dvacet dva, dvacátýho devátýho//), víceslovná adverbia (//přece jenom, všude možně//), výrazy s komponentem //(ne)vím// (//nevim kam; nevím co, bůh ví, čert ví//) a spojení předložky a zájmena //na ňho//. | *** written separately**: multiword contact expressions (//no no; prosim tě//), conjuncts (//i když//), numerals (//čtyři sta, dvacet dva, dvacátýho devátýho//), multiword adverbials (//přece jenom, všude možně//), expressions with a component //(ne)vím// (//nevim kam; nevím co, bůh ví, čert ví//) and and phrases with a preposition and pronoun //na ňho//. |
*** psaní s malým počátečním písmenem**: jména nápojů (//frankovka, mattonka, gambrinus//), značky vozidel (//fabia, fiat, zetor//), internetových vyhledávačů //google, youtube// | *** written with a lower case first letter**: names of beverages (//frankovka, mattonka, gambrinus//), vehicle brand names (//fabia, fiat, zetor//), internet browsers //google, youtube// |
| |
Citlivé osobní údaje jsou v přepisech [[cnk:oral:pravidla#anonymizacni_znacky|kódovány]] podle přání nahrávajících. | Sensitive personal information is encoded in the transcription according to the wishes of the recorded speakers. More detailed information and an overview of the transcription symbols can be found in the [[cnk:oral:pravidla|Transcription]] section (Czech only). |
Podrobnější údaje a přehled transkripčních značek se nachází v oddílu [[cnk:oral:pravidla|Transkripce]]. | |
| |
===== Zobrazení ===== | ===== View ===== |
Pro mluvené korpusy byl zároveň implementován nový, názorný způsob zobrazení dialogu, který přehledně ukazuje střídání mluvčích, zachycuje jejich souběžný hovor (pro části ORAL2013 a ORAL-Z) a pomocí přezdívky jednoznačně identifikuje mluvčí. | For spoken corpora we have implemented a new, graphic interface for viewing dialogues, which clearly shows the alternating speakers, captures their concurrent speech (for the ORAL2013 and ORAL-Z sections) and distinctly identifies the speaker with the help of the alias. |
| |
| |
[{{:cnk:oral5_promluvy_kocka.png | Zobrazení promluv a překryvu v dialogu. }}] | [{{:cnk:oral5_promluvy_kocka.png | The depiction of utterances and overlaps in dialogue. }}] |
| |
| |
| |
<WRAP round tip 80%> | <WRAP round tip 80%> |
Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P. - Waclawičová, M. - Benešová, L. – Křen, M.: //ORAL: korpus neformální mluvené češtiny, verze 1 z 2. 6. 2017//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from : http://www.korpus.cz | Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P. - Waclawičová, M. - Benešová, L. – Křen, M.: //ORAL: korpus neformální mluvené češtiny, verze 1 z 2. 6. 2017//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz |
| |
Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P.: Korpus ORAL: sestavení, lemmatizace a morfologické značkování. In //Korpus - Gramatika - Axiologie// 2017 (in print). | Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P. (2017): Korpus ORAL: sestavení, lemmatizace a morfologické značkování. In //Korpus - Gramatika - Axiologie// 15, 47-67. |
| |
Lukeš. D. - Klimešová, P. - Komrsková, Z. - Kopřivová, M. (2015) : Experimental Tagging of the ORAL Series Corpora: Insights on Using a Stochastic Tagger. In: //TSD 2015//, Ed. P. Král a V. Matoušek. Springer international Publishing, 342-350. | Lukeš. D. - Klimešová, P. - Komrsková, Z. - Kopřivová, M. (2015) : Experimental Tagging of the ORAL Series Corpora: Insights on Using a Stochastic Tagger. In: //TSD 2015//, Ed. P. Král a V. Matoušek. Springer international Publishing, 342-350. |
| |
<WRAP round box 72%> | <WRAP round box 72%> |
[[en:cnk:oral:pravidla|Transcription in the ORAL corpus]] • [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:oral2006|ORAL2006]] • [[en:cnk:oral2008|ORAL2008]] • [[en:cnk:oral2013|ORAL2013]] • [[en:cnk:dialekt|Dialect]] • [[en:pojmy:mluveny|Spoken language corpus]] • [[en:pojmy:atributy_strukturni#strukturni_atributy_korpusu_rady_oral|ORAL corpus structure]] • [[en:kurz:hledani_v_mluvenych_korpusech|Searching in spoken corpora]] • [[en:kurz:hledani_ORTOFON|Searching in the ORTOFON corpus]] | [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:oral2006|ORAL2006]] • [[en:cnk:oral2008|ORAL2008]] • [[en:cnk:oral2013|ORAL2013]] • [[en:cnk:dialekt|Dialect]] |
</WRAP> | </WRAP> |