Corpus name	Dialekt_dial•v2	Dialekt_ort•v2
Number of positions (tokens)	310 200	298 539
Number of positions (tokens) without punctuation and other symbols	223 281	223 327
Number of word forms (words)	33 715	25 360
Number of recordings	972
Number of utterances	43 628
Number of speakers	291
Length of recordings (hh:mm:ss.ms)	27:43:21.423
Publication date	December 23rd, 2021

DIALEKT corpus

The DIALEKT corpus presents traditional regional dialects captured over the entire Czech Republic. The dialect material was acquired by transcribing sound recordings coming from all dialectal regions of the Czech Republic. Additionally, several probes were recorded in Poland. The corpus is composed of two levels. The older dialectal level contains recordings which were made in the period from the end of the 1950s until the 1980s. The newer level contains probes covering the period from the 1990s until the present. For both layers, we have language data which capture archaic dialectal elements which do not generally occur in the present day usage.

The second version of the dialect corpus contains more than 220 000 words and will gradually expand. We assume that it will serve not only for specialists (dialectologists, other linguists and researchers from related fields) but also for example as a practical learning aid for high schools and universities. In the future, it should also be supplemented with interactive maps with dialectal features from the individual regional dialects, excerpts from transcripts and recordings from selected locations, and other useful additions.

Composition of DIALEKT and data collection

The DIALEKT corpus contains representations of all dialect regions in the Czech Republic, see Map of dialect regions in CR, which means that the language material is regionally varied. Probes from the Bohemian, Moravian and Silesian border areas have so far not been included in the data collection. Currently, our top priority is the collection of sufficient language data, and therefore we are not yet taking steps to balance the corpus.

Corpus name	Dialekt_dial•v1	Dialekt_ort•v1
Number of positions (tokens)	128 289	126 131
Number of positions (tokens) without punctuation and other symbols	99 552	99 581
Number of word forms (words)	19 189	15 061
Number of recordings	324
Number of utterances	9 745
Number of speakers	178
Length of recordings (hh:mm:ss.ms)	12:40:24.771
Publication date	June 6th, 2017

A section of the older level is composed of language material acquired by the Department of Dialectology of the Institute of the Czech Language of the Academy of Sciences of the Czech Republic, v. v. i., published in the appendix to the Czech language atlas (Balhar 2011), which is also the source of the recordings made in Poland. The remainder of the older level is composed of private collections made by individuals, most of which have also been published. The newer level of the corpus is composed of the collections of institutions, mostly from separate university faculties, private collections of individuals and last but not least the collections of dialect probes made by the Institute of the Czech National Corpus.

Regarding the method of data collection, the principles commonly used in Czech dialectology are applied. In this phase of acquiring dialect material, our primary focus is on capturing the oldest state of traditional territorial dialects. In the case of both corpus levels the dialect field research is therefore concerned exclusively with members of the oldest generation (at this point we have not discovered generational differences), in order to capture the original dialect features. The speakers are predominantly locals from rural areas whose ancestors had been living in the same location for generations, who only rarely relocated and were part of the agricultural way of life or practiced a craft.The most frequently chosen dialect speakers were those over 60 years of age, who were born in the period between the end of the 19th Century and the 1st half of the 20th Century.

The conversations have a rather informal character, even though the explorators (interviewers) made the recordings with the informers (dialect speakers) in the form of guided interviews – a method used in dialectology. The majority of the transcribed dialect recordings contain a usually unprepared monologue-type speech taking place in a private domestic environment. The topics of the talks usually relate to the traditional rural life and the world at the time and are therefore connected to agriculture, crafts, local customs and traditions, folklore, events of the period etc., e.g. Weaving, About the Cursed Snake, The beginning of World War II. In these talks, dialectisms from all language levels are preserved (phonetic and phonological, morphological, syntactic and lexical).

The dialect corpus also contains an extensive sociolinguistic tagging system, which can be used to create subcorpora.

Map of dialect regions in CR

Processing dialect recordings

Dialect material in the DIALEKT corpus is processed using the ELAN tool (developed in the Max Planck Institute for Psycholinguistics, Nijmegen¹⁾) with two transcription tiers – dialectological and orthographic, see transcription principles (Czech only). The basic transcript is dialectological and is based on the rules for the transcription of scientific dialectological texts. The second transcription tier contains the orthographic transcription, which approaches the usual form of written texts and is comparable to the general rules established for spoken corpora in the Czech National Corpus (CNC). DIALEKT is, similarly to the corpora ORAL and ORTOFON lemmatized and morphologically tagged. Due to the extensive variability of dialect material and insufficient training data sets, the tagging and lemmatization process was extremely complicated, and it is necessary to keep this in mind when considering the outcome.

After entering a query in the KonText interface, we are shown either only one selected transcription tier, or both tiers simultaneously as parallel corpora standing next to each other. It is only up to us to select the primary tier (dialectological or orthographic). This tier then displays all of the corpus functions – it is possible to play parts of the recording by the segment, change settings to display other information, positional or structural units and attributes etc.

Acknowledgements

We would like to thank all those who took part in acquiring the recordings and those who provided their dialect material for processing. We also thank the editors and reviewers. This corpus could not have been created without the invaluable assistance of dialectologists, especially Jarmila Bachmannová, or without the collaboration with cartographer Karel Kupka. Many thanks to the entire work team.

How to cite

Goláňová, H. – Waclawičová, M. – Lukeš, D.: DIALEKT: nářeční korpus, verze 2 z 23. 12. 2021. Ústav Českého národního korpusu FF UK, Praha 2021. Retrieved from: http://www.korpus.cz

Goláňová, H. – Waclawičová, M. – Komrsková, Z. – Lukeš, D. – Kopřivová, M. – Poukarová, P.: DIALEKT: nářeční korpus, verze 1 z 2. 6. 2017. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz

Goláňová, H. – Waclawičová, M. (2019): The DIALEKT corpus and its possibilities. Jazykovedný časopis, 70(2), 336-344. ISSN 0021-5597.

Komrsková, Z. - Kopřivová, M. - Lukeš, D. - Poukarová, P. - Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. Jazykovedný časopis, 68(2), 219-228. ISSN 0021-8897.

Goláňová, H. (2015): A new dialect corpus: DIALEKT. In Katarína Gajdošová - Adriana Žáková (eds.): Proceedings of the Eight International Conference Slovko 2015 (Natural Language Processing, Corpus Linguistics, Lexicography). Lüdenscheid: RAM-Verlag, 36-44. ISBN 978-3-942303-32-3.