Differences

This shows you the differences between two versions of the page.

--- en:cnk:oral [2017/07/04 16:48] – [ORAL Corpus] veronikapojarova
+++ en:cnk:oral [2017/07/05 15:49] – [Modification of transcription] veronikapojarova
@@ Line 8: / Line 8: @@
 ^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation or comments |  5 368 392 |
 ^ Number of [[en:pojmy:word| word forms (words)]] |  193 497 |
-^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|conversation recordings]] |  1 546 |
+^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|recorded conversations]] |  1 546 |
 ^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|speaking turns]] |  696 918 |
 ^ Number of unique (different) speakers |  1 297 |
@@ Line 14: / Line 14: @@
 </WRAP>
-===== Vytvoření korpusu ORAL =====
+===== Creating the ORAL corpus =====
-Korpus vznikl spojením a opravením dat z existujících korpusů ORAL2006, ORAL2008 a ORAL2013 a doplněním o část ORAL-Z, která obsahuje i několik nahrávek formálních situací. Tyto formální situace zachycují komunikaci, v níž jeden z mluvčích zastupuje nějakou instituci – např. pracovní rozhovor, rozhovor na úřadě, při nakupování apod., nebo jde o připravený mluvený projev, přednášku. Informace o **původním korpusu**, ze kterého nahrávka pochází, umožňuje vytvořit si stejný subkorpus s opravenými daty, doplněný lemmatizací a morfologickým značkováním.
+The corpus was created by merging and correcting data from the already existing corpora ORAL2006, ORAL2008 and ORAL2013, and by adding the ORAL-Z section, which additionally contains several recordings of formal situations. These formal situations capture communication in which one of the speakers represents an institution – e.g. job interview, conversation at the office, in the shop etc., or else it can be a prepared speech, e.g. a lecture. Information about the **original corpus** from which the recording was taken allows us to create an identical subcorpus with corrected data and with added lemmatization and morphological tagging.
-Kvůli [[:cnk:oral#Úprava transkripce|opravám a změnám tokenizace]] doznaly i dříve zveřejněné složky korpusu ORAL změn ve velikosti. Pro přehled a případné srovnání s původními korpusy zde uvádíme velikosti všech částí nového korpusu (počet pozic bez interpunkce a komentářů / počet pozic celkem):
+Due to [[en:cnk:oral#Úprava transkripce|corrections and changes to tokenization]], even previously published sections of the ORAL corpus have changed in size. To provide an overview and a comparison with the original corpora, we have included the size of all sections in the new corpus (number of positions without punctuation and comments / total number of positions):
   * ORAL2006: 999 380 / 1 149 678
@@ Line 25: / Line 25: @@
   * ORAL-Z: 623 688 / 763 532
-[{{ :cnk:oral:map.png?600 | Relativní zastoupení mluvčích z různých míst ČR (počty mluvčích podle místa narození).}}]
+[{{ :cnk:oral:map.png?600 | Relative representations of speakers from various regions of the Czech Republic (number of speakers according to place of birth).}}]
-Absolutní počty mluvčích podle místa narození i s údaji o zeměpisné šířce a délce jsou k dispozici {{:cnk:oral:geocounts.xlsx|ke stažení ve formátu .xlsx}}.
+The absolute values for the number of speakers according to place of birth, along with longitude and latitude coordinates, are available {{:cnk:oral:geocounts.xlsx|for download in .xlsx format}}.
-==== Úprava sociolingvistických údajů ====
+==== Modification of sociolinguistic data ====
-  * **nářeční oblasti** (8 tradičních + české a moravské pohraničí) byly změněny podle členění v ČJA (Balhar, 1992) a jejich hranice upraveny podle novějších výzkumů (viz [[cnk:dialekt#mapa_narecnich_oblasti_cr|mapa nářečních oblastí]])
+  * **dialect regions** (8 traditional + Bohemian and Moravian border areas) were changed based on the categories used in the ČJA (Balhar, 1992) and their borders were modified based on the latest research (see [[cnk:dialekt#mapa_narecnich_oblasti_cr|the map of dialect regions]])
-  * identifikace **shodných mluvčích**: v rámci nahrávek pořízených během let 2002–2007 (korpusy ORAL2006, ORAL2008 a ORAL-Z) byli zpětně propojeni shodní mluvčí, v nahrávkách z let 2008–2011 (korpus ORAL2013) už tato shoda označena byla; shodní mluvčí mezi oběma časovými obdobími označováni nebyli
+  * marking **identical speakers**: in the recordings made in the years 2002–2007 (corpora ORAL2006, ORAL2008 and ORAL-Z), any cases of identical speakers were later connected, and in recordings from the years 2008–2011 (ORAL2013 corpus) this congruence had already been marked; identical speakers across both time periods were not marked
-  * doplnění **přezdívky** pro identifikaci totožného mluvčího: každý mluvčí je v korpusu ORAL označen náhodně vybraným českým křestním jménem odpovídajícího pohlaví + identifikačním číslem (např. Simona_450)((V korpusu ORAL2013 byla přezdívka tvořena náhodně vygenerovaným shlukem písmen, pro ženy zakončena vokálem, pro muže konsonantem.))
+  * adding an **alias** for the identification of the same speaker: every single speaker in the ORAL corpus is labelled with a randomly chosen Czech first name of the corresponding gender + identification number (e.g. Simona_450)((In the ORAL2013 corpus the alias was formed by a randomly generated string of letters ending with a vowel for women and a consonant for men.))
-  * nově doplněno pro všechny mluvčí **zaměstnání** podle klasifikace zaměstnání a **údaj o tom, kolika procenty se dotyčný mluvčí podílí** na počtu tokenů (korpusových pozic) v nahrávce (viz [[pojmy:atributy_strukturni#atributy_spolecne_vsem_korpusum_rady_oral|údaje o mluvčím]])
+  * newly added **employment** for all speakers based on the classification of employment and **the percentage of the given speaker's share** in the number of tokens (positions in the corpus) in the recording (see [[en:pojmy:atributy_strukturni#atributy_spolecne_vsem_korpusum_rady_oral|speaker details]])
-  * stejné zůstávají **binární kategorie** pro
+  * the **binary categories** remain the same for
-    * pohlaví: ženy, muži
+    * gender: female, male
-    * věk: 18–35 let, 35 let a více
+    * age: 18–35 years, 35 years and up
-    * vzdělání: nižší (ZŠ, SŠ) a vyšší (VŠ i započaté)
+    * education: lower (primary school, high school) and higher (university education - including unfinished)
-==== Úprava segmentace ====
+==== Modification of segmentation ====
-  * maximální **délka segmentů** u nahrávek spojených se zvukem z části korpusu ORAL2013 je 15 slov, u části ORAL-Z 25 slov (prodlouženo pro lepší poslech příslušného úseku); transkripty bez zvuku jsou členěny na repliky (úsek řeči jednoho mluvčího, než je vystřídán komunikačním partnerem)
+  * the maximum **segment length** for recordings linked to audio from the ORAL2013 section of the corpus is 15 words, and 25 words for the ORAL-Z section (made longer in order for the given section to be heard better); transcripts without audio are segmented into speaker turns (one speaker's section of speech before he is superseded by his communication partner)
-  * **přerušení repliky** druhým mluvčím, po kterém došlo k **navázání** na původní téma, se označuje znaménkem plus ''(+)''; pokud nedošlo k navázání na původní téma, znaménkem minus ''(-)''
+  * a **turn which was interrupted** by the second speaker, following which the original topic was **reastablished** is marked with a plus sign ''(+)''; if the original topic was not brought up again, it is marked with a minus sign ''(-)''
-  * **interpunkce** v částech ORAL2013 a ORAL-Z je pauzová; syntaktická interpunkce, užívaná pro korpusy ORAL2006 a ORAL2008, byla změněna následujícím způsobem: čárky byly smazány bez náhrady, tečky byly nahrazeny čárkami
+  * **punctuation** in the ORAL2013 and ORAL-Z sections is pause-based; syntactic punctuation, used in the ORAL2006 and ORAL2008 corpora, was altered in the following way: commas were deleted with no replacement, full stops were replaced by commas
-==== Úprava transkripce ====
+==== Modification of transcription ====
-Transkripce ve spojeném korpusu ORAL zachovává většinu transkripčních zásad platných pro korpusy, v některých případech však došlo k jejich úpravě a sjednocení ((Všechny již publikované korpusy zároveň zůstávají v referenční, neměnné podobě.)). Přepis dat z části ORAL-Z odpovídá v podstatě transkripčním zásadám korpusu {{:cnk:prepisovaci_pravidla_oral2013.pdf|ORAL2013}}. Rozdílnost transkripce je způsobena nejen chybami a změnou pravidel, ale často i možností dubletního zápisu v psaných textech.
+Transcription in the joint corpus ORAL retains most of the usual corpus transcription rules. However, in a number of cases they have been modified and unified ((All previously published corpora simultaneously remain in a referential, unaltered form.)). The transcript of sections of the ORAL-Z corpus essentially conforms to the transcription rules of the {{:cnk:prepisovaci_pravidla_oral2013.pdf|ORAL2013}} corpus. The differences of the transcriptions are caused not only by errors and changed rules, but often also by the possibility of double entries in written texts.
-Tam, kde to bylo možné, byla transkripce sjednocována následujícím způsobem:
+Wherever possible, the transcription was unified in the following manner:
-   *** psaní dohromady:**  slova cizího původu (//nonstop, secondhand//), citátová spojení (//apriori, defacto//), spřežky s možností dvojího zápisu (//bezesporu, načerno, vodmalička//), číslovky s komponentem krát (//čtyřikrát//), substantivizované číslovky (//dvacetdevítka//), spojky (//anebo, abysem//), citoslovce (//bubu, čičí, díkybohu//),
+   *** written together:**  foreign origin words (//nonstop, secondhand//), quoted phrases (//apriori, defacto//), digraphs with two possible spellings (//bezesporu, načerno, vodmalička//), numerals with the component "krát" (//čtyřikrát//), substantivized numerals (//dvacetdevítka//), conjuncts (//anebo, abysem//), interjections (//bubu, čičí, díkybohu//),
-   *** psaní zvlášť**: víceslovné kontaktové výrazy (//no no; prosim tě//), spojky (//i když//), číslovky (//čtyři sta, dvacet dva, dvacátýho devátýho//), víceslovná adverbia (//přece jenom, všude možně//), výrazy s komponentem //(ne)vím// (//nevim kam; nevím co, bůh ví, čert ví//) a spojení předložky a zájmena //na ňho//.
+   *** written separately**: multiword contact expressions (//no no; prosim tě//), conjuncts (//i když//), numerals (//čtyři sta, dvacet dva, dvacátýho devátýho//), multiword adverbials (//přece jenom, všude možně//), expressions with a component //(ne)vím// (//nevim kam; nevím co, bůh ví, čert ví//) and and phrases with a preposition and pronoun //na ňho//.
-  *** psaní s malým počátečním písmenem**: jména nápojů (//frankovka, mattonka, gambrinus//), značky vozidel (//fabia, fiat, zetor//), internetových vyhledávačů //google, youtube//
+  *** written with a lower case first letter**: names of beverages (//frankovka, mattonka, gambrinus//), vehicle brand names (//fabia, fiat, zetor//), internet browsers //google, youtube//
-Citlivé osobní údaje jsou v přepisech [[cnk:oral:pravidla#anonymizacni_znacky|kódovány]] podle přání nahrávajících.
+Sensitive personal information is [[en:cnk:oral:pravidla#anonymizacni_znacky|encoded]] in the transcription according to the wishes of the recorded speakers.
-Podrobnější údaje a přehled transkripčních značek se nachází v oddílu [[cnk:oral:pravidla|Transkripce]].
+More detailed information and an overview of the transcription symbols can be found in the [[en:cnk:oral:pravidla|Transcription]] section.
-===== Zobrazení  =====
+===== View =====
-Pro mluvené korpusy byl zároveň implementován nový, názorný způsob zobrazení dialogu, který přehledně ukazuje střídání mluvčích, zachycuje jejich souběžný hovor (pro části ORAL2013 a ORAL-Z) a pomocí přezdívky jednoznačně identifikuje mluvčí.
+For spoken corpora we have implemented a new, graphic interface for viewing dialogues, which clearly shows the alternating speakers, captures their concurrent speech (for the ORAL2013 and ORAL-Z sections) and distinctly identifies the speaker with the help of the alias.
-[{{:cnk:oral5_promluvy_kocka.png | Zobrazení promluv a překryvu v dialogu. }}]
+[{{:cnk:oral5_promluvy_kocka.png | The depiction of utterances and overlaps in dialogue. }}]

Trace:

Differences

Search

Navigation

Print/export

Tools

Languages

Licence