Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:dialekt [2017/07/18 14:58] – [Composition of DIALEKT and data collection] michalkren | en:cnk:dialekt [2022/01/05 16:01] (current) – [How to cite] martinawaclawicova |
---|
~~NOTOC~~ | ~~NOTOC~~ |
| |
| <WRAP right 35%> |
| ^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial•v2**</fs>| <fs medium>**Dialekt_ort•v2**</fs>| |
| ^ Number of [[en:pojmy:token|positions (tokens)]] | 310 200| 298 539| |
| ^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols | 223 281| 223 327| |
| ^ Number of [[en:pojmy:word| word forms (words)]] | 33 715| 25 360| |
| ^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|recordings]] | 972|| |
| ^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|utterances]] | 43 628|| |
| ^ Number of speakers | 291|| |
| ^ Length of recordings (hh:mm:ss.ms) | 27:43:21.423|| |
| ^ Publication date | December 23rd, 2021|| |
| </WRAP> |
| |
====== DIALEKT corpus ====== | ====== DIALEKT corpus ====== |
| |
The **DIALEKT** corpus presents traditional regional dialects captured over the entire Czech Republic. The dialect material was acquired by transcribing sound recordings coming from all dialectal regions of the Czech Republic. Additionally, several probes were recorded in Poland. The corpus is composed of two levels. The older dialectal level contains recordings which were made in the period from the end of the 1950s until the 1980s. The newer level contains probes covering the period from the 1990s until the present. For both layers, we have language data which capture archaic dialectal elements which do not generally occur in the present day usage. | The **DIALEKT** corpus presents traditional regional dialects captured over the entire Czech Republic. The dialect material was acquired by transcribing sound recordings coming from all dialectal regions of the Czech Republic. Additionally, several probes were recorded in Poland. The corpus is composed of two levels. The older dialectal level contains recordings which were made in the period from the end of the 1950s until the 1980s. The newer level contains probes covering the period from the 1990s until the present. For both layers, we have language data which capture archaic dialectal elements which do not generally occur in the present day usage. |
| |
The first version of the dialect corpus contains approx. 100 000 words and will gradually expand. We assume that it will serve not only for specialists (dialectologists, other linguists and researchers from related fields) but also for example as a practical learning aid for high schools and universities. In the future, it should also be supplemented with interactive maps with dialectal features from the individual regional dialects, excerpts from transcripts and recordings from selected locations, and other useful additions. | The second version of the dialect corpus contains more than 220 000 words and will gradually expand. We assume that it will serve not only for specialists (dialectologists, other linguists and researchers from related fields) but also for example as a practical learning aid for high schools and universities. In the future, it should also be supplemented with interactive maps with dialectal features from the individual regional dialects, excerpts from transcripts and recordings from selected locations, and other useful additions. |
| |
| ====== Composition of DIALEKT and data collection ====== |
| |
| The **DIALEKT** corpus contains representations of all dialect regions in the Czech Republic, see [[#Map of dialect regions in CR]], which means that the language material is regionally varied. Probes from the Bohemian, Moravian and Silesian border areas have so far not been included in the data collection. Currently, our top priority is the collection of sufficient language data, and therefore we are not yet taking steps to balance the corpus. |
| |
<WRAP right 35%> | <WRAP right 35%> |
^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial**</fs>| <fs medium>**Dialekt_ort**</fs>| | ^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial•v1**</fs>| <fs medium>**Dialekt_ort•v1**</fs>| |
^ Number of [[en:pojmy:token|positions (tokens)]] | 128 289| 126 131| | ^ Number of [[en:pojmy:token|positions (tokens)]] | 128 289| 126 131| |
^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols | 99 552| 99 581| | ^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols | 99 552| 99 581| |
^ Number of speakers | 178|| | ^ Number of speakers | 178|| |
^ Length of recordings (hh:mm:ss.ms) | 12:40:24.771|| | ^ Length of recordings (hh:mm:ss.ms) | 12:40:24.771|| |
| ^ Publication date | June 6th, 2017|| |
</WRAP> | </WRAP> |
| |
====== Composition of DIALEKT and data collection ====== | |
| |
The **DIALEKT** corpus contains representations of all dialect regions in the Czech Republic, see [[#Map of dialect regions in CR]], which means that the language material is regionally varied. Probes from the Bohemian, Moravian and Silesian border areas have so far not been included in the data collection. Currently, our top priority is the collection of sufficient language data, and therefore we are not yet taking steps to balance the corpus. | |
| |
A section of the older level is composed of language material acquired by the Department of Dialectology of the Institute of the Czech Language of the Academy of Sciences of the Czech Republic, v. v. i., published in the appendix to the //Czech language atlas// (Balhar 2011), which is also the source of the recordings made in Poland. The remainder of the older level is composed of private collections made by individuals, most of which have also been published. The newer level of the corpus is composed of the collections of institutions, mostly from separate university faculties, private collections of individuals and last but not least the collections of dialect probes made by the Institute of the Czech National Corpus. | A section of the older level is composed of language material acquired by the Department of Dialectology of the Institute of the Czech Language of the Academy of Sciences of the Czech Republic, v. v. i., published in the appendix to the //Czech language atlas// (Balhar 2011), which is also the source of the recordings made in Poland. The remainder of the older level is composed of private collections made by individuals, most of which have also been published. The newer level of the corpus is composed of the collections of institutions, mostly from separate university faculties, private collections of individuals and last but not least the collections of dialect probes made by the Institute of the Czech National Corpus. |
===== Map of dialect regions in CR ===== | ===== Map of dialect regions in CR ===== |
| |
{{:cnk:oblasti_ridsi_mod2.jpg?direct&500| Map of dialect regions in CR}} | {{:en:cnk:oblasti_ridsi_2021_wiki.png?direct&500| Map of dialect regions in CR}} |
====== Processing dialect recordings ====== | ====== Processing dialect recordings ====== |
| |
Dialect material in the **DIALEKT** corpus is processed with two transcription tiers – dialectological and orthographic, see [[en:cnk:dialekt:pravidla|transcription principles]]. The basic transcript is dialectological and is based on the rules for the transcription of scientific dialectological texts. The second transcription tier contains the orthographic transcription, which approaches the usual form of written texts and is comparable to the general rules established for spoken corpora in the Czech National Corpus (CNC). | Dialect material in the **DIALEKT** corpus is processed with two transcription tiers – dialectological and orthographic, see [[cnk:dialekt:pravidla|transcription principles]] (Czech only). The basic transcript is dialectological and is based on the rules for the transcription of scientific dialectological texts. The second transcription tier contains the orthographic transcription, which approaches the usual form of written texts and is comparable to the general rules established for spoken corpora in the Czech National Corpus (CNC). |
**DIALEKT** is, similarly to the corpora **[[en:cnk:oral|ORAL]]** and **[[en:cnk:ortofon|ORTOFON]]** [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]]. Due to the extensive variability of dialect material and insufficient training data sets, the tagging and lemmatization process was extremely complicated, and it is necessary to keep this in mind when considering the outcome. | **DIALEKT** is, similarly to the corpora **[[en:cnk:oral|ORAL]]** and **[[en:cnk:ortofon|ORTOFON]]** [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]]. Due to the extensive variability of dialect material and insufficient training data sets, the tagging and lemmatization process was extremely complicated, and it is necessary to keep this in mind when considering the outcome. |
| |
After entering a query in the [[en:manualy:kontext:index|KonText]] interface, we are shown either only one selected transcription tier, or both tiers simultaneously as parallel corpora standing next to each other. It is only up to us to select the primary tier (dialectological or orthographic). This tier then displays all of the corpus functions – it is possible to play parts of the recording by the segment, change settings to display other information, [[en:pojmy:atributy_pozicni|positional]] or [[en:pojmy:atributy_strukturni#strukturni_atributy_mluvenych_korpusu|structural units and attributes]] etc., see [[en:cnk:dialekt:prace|Working with the DIALEKT corpus]]. | After entering a query in the [[en:manualy:kontext:index|KonText]] interface, we are shown either only one selected transcription tier, or both tiers simultaneously as parallel corpora standing next to each other. It is only up to us to select the primary tier (dialectological or orthographic). This tier then displays all of the corpus functions – it is possible to play parts of the recording by the segment, change settings to display other information, positional or structural units and attributes etc. |
| |
===== Acknowledgements ===== | ===== Acknowledgements ===== |
===== How to cite ===== | ===== How to cite ===== |
<WRAP round tip 70%> | <WRAP round tip 70%> |
| Goláňová, H. – Waclawičová, M. – Lukeš, D.: //DIALEKT: nářeční korpus, verze 2 z 23. 12. 2021//. Ústav Českého národního korpusu FF UK, Praha 2021. Retrieved from: http://www.korpus.cz\\ |
| |
Goláňová, H. – Waclawičová, M. – Komrsková, Z. – Lukeš, D. – Kopřivová, M. – Poukarová, P.: //DIALEKT: nářeční korpus, verze 1 z 2. 6. 2017//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz\\ | Goláňová, H. – Waclawičová, M. – Komrsková, Z. – Lukeš, D. – Kopřivová, M. – Poukarová, P.: //DIALEKT: nářeční korpus, verze 1 z 2. 6. 2017//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz\\ |
| |
Goláňová, H. (2015): A new dialect corpus: DIALEKT. In Katarína Gajdošová - Adriana Žáková (eds.): //Proceedings of the Eight International Conference Slovko 2015 (Natural Language Processing, Corpus Linguistics, Lexicography)//. Lüdenscheid: RAM-Verlag, s. 36-44. ISBN 978-3-942303-32-3.\\ | Goláňová, H. – Waclawičová, M. (2019): The DIALEKT corpus and its possibilities. Jazykovedný časopis, 70(2), 336-344. ISSN 0021-5597. |
| |
Goláňová, H. – Kopřivová, M. – Lukeš, D. – Štěpán, M. (2015): Kartografické a geografické zpracování dat z mluvených korpusů. In //Korpus – gramatika – axiologie//, 11, s. 42-54. ISSN: 1804-137X | Komrsková, Z. - Kopřivová, M. - Lukeš, D. - Poukarová, P. - Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. //Jazykovedný časopis//, 68(2), 219-228. ISSN 0021-8897. |
</WRAP> | |
| |
Corpus compilation and project coordination was secured by //Hana Goláňová//, corpus preparation and proofreading of transcription by //Martina Waclawičová//, the orthographic transcription tier by //Zuzana Komrsková//, technical creation of the corpus by //David Lukeš// and lemmatization and morphological tagging was prepared by //Zuzana Komrsková//, //Marie Kopřivová//, //David Lukeš// and //Petra Poukarová//. | Goláňová, H. (2015): A new dialect corpus: DIALEKT. In Katarína Gajdošová - Adriana Žáková (eds.): //Proceedings of the Eight International Conference Slovko 2015 (Natural Language Processing, Corpus Linguistics, Lexicography)//. Lüdenscheid: RAM-Verlag, 36-44. ISBN 978-3-942303-32-3.\\ |
| |
| </WRAP> |
| |
===== Related links ===== | ===== Related links ===== |
| |
<WRAP round box 70%> | <WRAP round box 70%> |
[[en:cnk:dialekt:pravidla|Transcription in the DIALEKT corpus]] • [[en:cnk:dialekt:prace|Working with the DIALEKT corpus]] • [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:diakorp|DIAKORP]] • [[en:pojmy:synchronni|Synchronic corpora]] • [[en:pojmy:reprezentativnost|Representativity]] • [[en:pojmy:diachronni|Diachrony, diachronic corpora]] • [[en:cnk:struktura#korpusy_mluvene|Spoken corpora]] • [[en:cnk:lemtag_mluv|Lemmatization and tagging in spoken corpora]] | [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:diakorp|DIAKORP]] • [[en:cnk:lemtag_mluv|Lemmatization and tagging in spoken corpora]] |
</WRAP> | </WRAP> |
| |