ORAL Corpus

The ORAL corpus is a corpus containing the transcribed recordings of predominantly informal conversations taking place between native speakers of Czech from all regions of the Czech Republic. The speakers knew each other very well (they were either friends or family members) and they were recorded in their natural environment. The recordings were made over the course of ten years, between 2002 and 2011. The corpus is not balanced, with the majority of the data originating from the Bohemia region of the Czech Republic (for more visit the corpus structure). There is only one level of transcription, and wherever it was possible, it was unified along with tokenization for all parts of the corpus. The ORAL corpus unifies the corpora ORAL2006, ORAL2008, ORAL2013 and the as yet unpublished recordings ORAL-Z. The overall size of the corpus is 5 368 391 words, with a total recording time of 582 hours. Part of the transcripts are not linked to the audio (data from the corpora ORAL2006 and ORAL2008). The corpus is lemmatized and morphologically tagged. It uses the same type of morphological tagging as the contemporary written corpora.

Name	ORAL
Number of positions (tokens)	6 361 707
Number of positions (tokens) without punctuation or comments	5 368 392
Number of word forms (words)	193 497
Number of recorded conversations	1 546
Number of speaking turns	696 918
Number of unique (different) speakers	1 297
Length of recordings for ORAL2013 + ORAL-Z [hh:mm:ss.ms]	354:44:36.722

Creating the ORAL corpus

The corpus was created by merging and correcting data from the already existing corpora ORAL2006, ORAL2008 and ORAL2013, and by adding the ORAL-Z section, which additionally contains several recordings of formal situations. These formal situations capture communication in which one of the speakers represents an institution – e.g. job interview, conversation at the office, in the shop etc., or else it can be a prepared speech, e.g. a lecture. Information about the original corpus from which the recording was taken allows us to create an identical subcorpus with corrected data and with added lemmatization and morphological tagging.

Due to corrections and changes to tokenization, even previously published sections of the ORAL corpus have changed in size. To provide an overview and a comparison with the original corpora, we have included the size of all sections in the new corpus (number of positions without punctuation and comments / total number of positions):

ORAL2006: 999 380 / 1 149 678
ORAL2008: 995 484 / 1 172 509
ORAL2013: 2 749 840 / 327 5988
ORAL-Z: 623 688 / 763 532

Relative representations of speakers from various regions of the Czech Republic (number of speakers according to place of birth).

The absolute values for the number of speakers according to place of birth, along with longitude and latitude coordinates, are available for download in .xlsx format.

Modification of sociolinguistic data

dialect regions (8 traditional + Bohemian and Moravian border areas) were changed based on the categories used in the ČJA (Balhar, 1992) and their borders were modified based on the latest research (see the map of dialect regions)
marking identical speakers: in the recordings made in the years 2002–2007 (corpora ORAL2006, ORAL2008 and ORAL-Z), any cases of identical speakers were later connected, and in recordings from the years 2008–2011 (ORAL2013 corpus) this congruence had already been marked; identical speakers across both time periods were not marked
adding an alias for the identification of the same speaker: every single speaker in the ORAL corpus is labelled with a randomly chosen Czech first name of the corresponding gender + identification number (e.g. Simona_450)¹⁾
newly added employment for all speakers based on the classification of employment and the percentage of the given speaker's share in the number of tokens (positions in the corpus) in the recording (see speaker details)

the binary categories remain the same for
- gender: female, male
- age: 18–35 years, 35 years and up
- education: lower (primary school, high school) and higher (university education - including unfinished)

Úprava segmentace

maximální délka segmentů u nahrávek spojených se zvukem z části korpusu ORAL2013 je 15 slov, u části ORAL-Z 25 slov (prodlouženo pro lepší poslech příslušného úseku); transkripty bez zvuku jsou členěny na repliky (úsek řeči jednoho mluvčího, než je vystřídán komunikačním partnerem)
přerušení repliky druhým mluvčím, po kterém došlo k navázání na původní téma, se označuje znaménkem plus (+); pokud nedošlo k navázání na původní téma, znaménkem minus (-)
interpunkce v částech ORAL2013 a ORAL-Z je pauzová; syntaktická interpunkce, užívaná pro korpusy ORAL2006 a ORAL2008, byla změněna následujícím způsobem: čárky byly smazány bez náhrady, tečky byly nahrazeny čárkami

Úprava transkripce

Transkripce ve spojeném korpusu ORAL zachovává většinu transkripčních zásad platných pro korpusy, v některých případech však došlo k jejich úpravě a sjednocení ²⁾. Přepis dat z části ORAL-Z odpovídá v podstatě transkripčním zásadám korpusu ORAL2013. Rozdílnost transkripce je způsobena nejen chybami a změnou pravidel, ale často i možností dubletního zápisu v psaných textech.

Tam, kde to bylo možné, byla transkripce sjednocována následujícím způsobem:

psaní dohromady: slova cizího původu (nonstop, secondhand), citátová spojení (apriori, defacto), spřežky s možností dvojího zápisu (bezesporu, načerno, vodmalička), číslovky s komponentem krát (čtyřikrát), substantivizované číslovky (dvacetdevítka), spojky (anebo, abysem), citoslovce (bubu, čičí, díkybohu),
psaní zvlášť: víceslovné kontaktové výrazy (no no; prosim tě), spojky (i když), číslovky (čtyři sta, dvacet dva, dvacátýho devátýho), víceslovná adverbia (přece jenom, všude možně), výrazy s komponentem (ne)vím (nevim kam; nevím co, bůh ví, čert ví) a spojení předložky a zájmena na ňho.
psaní s malým počátečním písmenem: jména nápojů (frankovka, mattonka, gambrinus), značky vozidel (fabia, fiat, zetor), internetových vyhledávačů google, youtube

Citlivé osobní údaje jsou v přepisech kódovány podle přání nahrávajících. Podrobnější údaje a přehled transkripčních značek se nachází v oddílu Transkripce.

Zobrazení

Pro mluvené korpusy byl zároveň implementován nový, názorný způsob zobrazení dialogu, který přehledně ukazuje střídání mluvčích, zachycuje jejich souběžný hovor (pro části ORAL2013 a ORAL-Z) a pomocí přezdívky jednoznačně identifikuje mluvčí.

Zobrazení promluv a překryvu v dialogu.

Sources

Balhar, J. et al. (1992) : Český jazykový atlas. Academia. Praha.

Hajič, J. – Hlaváčová, J. (2013): MorfFlex CZ. Univerzita Karlova v Praze, MFF, ÚFAL, Praha.

Straka, M. – Straková, J. – Hajič, J. (2014): Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland: Association for Computational Linguistics, 3–18.

How to cite ORAL

Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P. - Waclawičová, M. - Benešová, L. – Křen, M.: ORAL: korpus neformální mluvené češtiny, verze 1 z 2. 6. 2017. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from : http://www.korpus.cz

Kopřivová, M. - Lukeš, D. - Komrsková, Z. - Poukarová, P.: Korpus ORAL: sestavení, lemmatizace a morfologické značkování. In Korpus - Gramatika - Axiologie 2017 (in print).

Lukeš. D. - Klimešová, P. - Komrsková, Z. - Kopřivová, M. (2015) : Experimental Tagging of the ORAL Series Corpora: Insights on Using a Stochastic Tagger. In: TSD 2015, Ed. P. Král a V. Matoušek. Springer international Publishing, 342-350.