AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:dialekt [2017/07/06 16:18] – [Composition of DIALEKT and data collection] veronikapojarovaen:cnk:dialekt [2022/01/05 16:01] (current) – [How to cite] martinawaclawicova
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
 +
 +<WRAP right 35%>
 +^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial•v2**</fs>| <fs medium>**Dialekt_ort•v2**</fs>|
 +^ Number of [[en:pojmy:token|positions (tokens)]] |  310 200|  298 539|
 +^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols |  223 281|  223 327|
 +^ Number of [[en:pojmy:word| word forms (words)]] |  33 715|  25 360|
 +^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|recordings]] |  972||
 +^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|utterances]] |  43 628||
 +^ Number of speakers |  291||
 +^ Length of recordings (hh:mm:ss.ms) |  27:43:21.423||
 +^ Publication date |  December 23rd, 2021||
 +</WRAP>
 +
 ====== DIALEKT corpus ====== ====== DIALEKT corpus ======
  
-The **DIALEKT** corpus presents traditional regional dialects captured over the entire Czech Republic. The dialect material was acquired by transcribing sound recordings coming from all dialectal regions of the Czech Republic. Additionally, several probes were recorded in Poland. The corpus is composed of two levels. The older dialectal level contains recordings which were made in the period from the end of the 1950s until the 1980s. The newer level contains probes covering the period from the 1990s until the present. For both layers we have language data which capture archaic dialectal elements which do not generally occur in the present day usage. +The **DIALEKT** corpus presents traditional regional dialects captured over the entire Czech Republic. The dialect material was acquired by transcribing sound recordings coming from all dialectal regions of the Czech Republic. Additionally, several probes were recorded in Poland. The corpus is composed of two levels. The older dialectal level contains recordings which were made in the period from the end of the 1950s until the 1980s. The newer level contains probes covering the period from the 1990s until the present. For both layerswe have language data which capture archaic dialectal elements which do not generally occur in the present day usage.  
 + 
 +The second version of the dialect corpus contains more than 220 000 words and will gradually expand. We assume that it will serve not only for specialists (dialectologists, other linguists and researchers from related fields) but also for example as a practical learning aid for high schools and universities. In the future, it should also be supplemented with interactive maps with dialectal features from the individual regional dialects, excerpts from transcripts and recordings from selected locations, and other useful additions. 
 + 
 +====== Composition of DIALEKT and data collection ======
  
-The first version of the dialect corpus contains approx. 100 000 words and will gradually expand. We assume that it will serve not only for specialists (dialectologistsother linguists and researchers from related fields)but also for example as a practical learning aid for high schools and universitiesIn the future it should also be supplemented with interactive maps with dialectal features from the individual regional dialectsexcerpts from transcripts and recordings from selected locations, and other useful additions.+The **DIALEKT** corpus contains representations of all dialect regions in the Czech Republic, see [[#Map of dialect regions in CR]]which means that the language material is regionally varied. Probes from the BohemianMoravian and Silesian border areas have so far not been included in the data collectionCurrently, our top priority is the collection of sufficient language data, and therefore we are not yet taking steps to balance the corpus.
  
 <WRAP right 35%> <WRAP right 35%>
-^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial**</fs>| <fs medium>**Dialekt_ort**</fs>|+^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial•v1**</fs>| <fs medium>**Dialekt_ort•v1**</fs>|
 ^ Number of [[en:pojmy:token|positions (tokens)]] |  128 289|  126 131| ^ Number of [[en:pojmy:token|positions (tokens)]] |  128 289|  126 131|
 ^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols |  99 552|  99 581| ^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols |  99 552|  99 581|
Line 15: Line 32:
 ^ Number of speakers |  178|| ^ Number of speakers |  178||
 ^ Length of recordings (hh:mm:ss.ms) |  12:40:24.771|| ^ Length of recordings (hh:mm:ss.ms) |  12:40:24.771||
 +^ Publication date |  June 6th, 2017||
 </WRAP> </WRAP>
- 
-====== Composition of DIALEKT and data collection ====== 
- 
-The **DIALEKT** corpus contains representations of all dialect regions in the Czech Republic, see [[#Map of dialect regions in CR]], which means that the language material is regionally varied. Probes from the Bohemian, Moravian and Silesian border areas have so far not been included in the data collection. Currently our top priority is the collection of sufficient language data, and therefore we are not yet taking steps to balance the corpus. 
  
 A section of the older level is composed of language material acquired by the Department of Dialectology of the Institute of the Czech Language of the Academy of Sciences of the Czech Republic, v. v. i., published in the appendix to the //Czech language atlas// (Balhar 2011), which is also the source of the recordings made in Poland. The remainder of the older level is composed of private collections made by individuals, most of which have also been published. The newer level of the corpus is composed of the collections of institutions, mostly from separate university faculties, private collections of individuals and last but not least the collections of dialect probes made by the Institute of the Czech National Corpus.  A section of the older level is composed of language material acquired by the Department of Dialectology of the Institute of the Czech Language of the Academy of Sciences of the Czech Republic, v. v. i., published in the appendix to the //Czech language atlas// (Balhar 2011), which is also the source of the recordings made in Poland. The remainder of the older level is composed of private collections made by individuals, most of which have also been published. The newer level of the corpus is composed of the collections of institutions, mostly from separate university faculties, private collections of individuals and last but not least the collections of dialect probes made by the Institute of the Czech National Corpus. 
  
-Regarding the method of data collection, the principles commonly used in Czech dialectology are applied. In this phase of acquiring dialect material our primary focus is on capturing the oldest state of traditional territorial dialects. In the case of both corpus levels the dialect field research is therefore concerned exclusivley with members of the oldest generation (at this point we have not discovered generational differences), in order to capture the original dialect features. The speakers are predominantly locals from rural areas whose ancestors had been living in the same location for generations, who only rarely relocated and were part of the agricultural way of life or practised a craft.The most frequently chosen dialect speakers were those over 60 years of age, who were born in the period between the end of the 19th Century and the 1st half of the 20th Century.+Regarding the method of data collection, the principles commonly used in Czech dialectology are applied. In this phase of acquiring dialect materialour primary focus is on capturing the oldest state of traditional territorial dialects. In the case of both corpus levels the dialect field research is therefore concerned exclusively with members of the oldest generation (at this point we have not discovered generational differences), in order to capture the original dialect features. The speakers are predominantly locals from rural areas whose ancestors had been living in the same location for generations, who only rarely relocated and were part of the agricultural way of life or practiced a craft.The most frequently chosen dialect speakers were those over 60 years of age, who were born in the period between the end of the 19th Century and the 1st half of the 20th Century.
  
-Promluvy mají spíše neformální rázpřestože je explorátoři (nahrávajícívedli s informátory (nářečními mluvčímiformou řízeného rozhovoru – metodou užívanou v dialektologiiMezi přepsanými nářečními nahrávkami se vyskytuje většinou nepřipravený monologický typ promluvy realizovaný v soukromém domácím prostředí Témata promluv souvisí s tradičním venkovským životem a tehdejší světemjsou tedy spojena se zemědělstvímřemeslymístními zvyky a tradicemilidovým folklorem, dobovými událostmi atp., napřTkalcováníO zakletém hadoviZačátek II. světové války. V těchto promluvách jsou dochovány dialektismy ze všech jazykových rovin (fonetické a fonologickémorfologickésyntaktické i lexikální)+The conversations have a rather informal charactereven though the explorators (interviewersmade the recordings with the informers (dialect speakersin the form of guided interviews – a method used in dialectologyThe majority of the transcribed dialect recordings contain a usually unprepared monologue-type speech taking place in a private domestic environmentThe topics of the talks usually relate to the traditional rural life and the world at the time and are therefore connected to agriculturecraftslocal customs and traditionsfolkloreevents of the period etc., e.g. WeavingAbout the Cursed SnakeThe beginning of World War II. In these talks, dialectisms from all language levels are preserved (phonetic and phonologicalmorphologicalsyntactic and lexical).
- +
-Nářeční korpus má také bohaté sociolingvistické značkování, což bude možné využít i při tvorbě subkorpusů, viz dvě nejspodnější tabulky v oddílu [[pojmy:atributy_strukturni#strukturni_atributy_mluvenych_korpusu|Strukturní atributy mluvených korpusů]].+
  
 +The dialect corpus also contains an extensive sociolinguistic tagging system, which can be used to create subcorpora.
  
 ===== Map of dialect regions in CR ===== ===== Map of dialect regions in CR =====
  
-{{:cnk:oblasti_ridsi_mod2.jpg?direct&500| Map of dialect regions in CR}} +{{:en:cnk:oblasti_ridsi_2021_wiki.png?direct&500| Map of dialect regions in CR}} 
-====== Zpracování nářečních nahrávek ======+====== Processing dialect recordings ======
  
-Nářeční materiál je v korpusu **DIALEKT** zpracováván tak, že má dvě úrovně přepisu – dialektologickou a ortografickouviz [[cnk:dialekt:pravidla|transkripční zásady]]. Základní přepis je dialektologický a vychází z pravidel pro přepis vědeckých dialektologických textůDruhou úroveň přepisu představuje ortografický přepisblížící se bežné podobě psaných textů, jenž je srovnatelný s obecnými pravidly stanovenými pro mluvené korpusy v Českém národním korpusu (ČNK). +Dialect material in the **DIALEKT** corpus is processed with two transcription tiers – dialectological and orthographicsee [[cnk:dialekt:pravidla|transcription principles]] (Czech only)The basic transcript is dialectological and is based on the rules for the transcription of scientific dialectological textsThe second transcription tier contains the orthographic transcriptionwhich approaches the usual form of written texts and is comparable to the general rules established for spoken corpora in the Czech National Corpus (CNC). 
-Korpus **DIALEKT** je podobně jako korpus **[[cnk:oral|ORAL]]** **[[cnk:ortofon|ORTOFON]]** [[cnk:lemtag_mluv|lemmatizovaný a morfologicky označkovaný]]. Vzhledem k velké variabilitě nářečního materiálu a nedostatku trénovacích dat byl ale proces značkování a lemmatizace značně komplikovaný a s vědomím toho je také třeba k výsledku přistupovat.+**DIALEKT** is, similarly to the corpora **[[en:cnk:oral|ORAL]]** and **[[en:cnk:ortofon|ORTOFON]]** [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]]. Due to the extensive variability of dialect material and insufficient training data sets, the tagging and lemmatization process was extremely complicated, and it is necessary to keep this in mind when considering the outcome.
  
-Při zadání dotazu v korpusovém rozhraní [[manualy:kontext:index|KonText]] se nám zobrazí buď pouze jedna vybraná rovina přepisunebo obě roviny současně jako paralelní korpusy stojící vedle sebe. Přitom záleží na náskterou rovinu (dialektologickou nebo ortografickousi zvolíme jako primárníNa té se pak zobrazují všechny funkce korpusu – je možné si pustit po segmentech část nahrávkynastavit zobrazení dalších informací[[pojmy:atributy_pozicni|pozičních]] nebo [[pojmy:atributy_strukturni#strukturni_atributy_mluvenych_korpusu|strukturních jednotek a atributů]] atp., viz [[cnk:dialekt:prace|Práce s korpusem Dialekt]].+After entering a query in the [[en:manualy:kontext:index|KonText]] interfacewe are shown either only one selected transcription tieror both tiers simultaneously as parallel corpora standing next to each other. It is only up to us to select the primary tier (dialectological or orthographic). This tier then displays all of the corpus functions – it is possible to play parts of the recording by the segmentchange settings to display other informationpositional or structural units and attributes etc.
  
-===== Poděkování =====+===== Acknowledgements =====
  
-Děkujeme všem, kteří se podíleli na pořizování nahrávek, a všem, kteří nám poskytli svůj nářeční materiál ke zpracováníPoděkování náleží také editorům a revizorůmTento korpus by rovněž nemohl vzniknout bez cenné pomoci dialektologůzvláště Jarmily Bachmannovénebo bez spolupráce s kartografem Karlem KupkouCelému pracovnímu týmu tímto děkujeme.  +We would like to thank all those who took part in acquiring the recordings and those who provided their dialect material for processingWe also thank the editors and reviewersThis corpus could not have been created without the invaluable assistance of dialectologistsespecially Jarmila Bachmannováor without the collaboration with cartographer Karel KupkaMany thanks to the entire work team.  
  
  
Line 48: Line 61:
  
  
-===== Jak citovat =====+===== How to cite  =====
 <WRAP round tip 70%> <WRAP round tip 70%>
-Goláňová, H. – Waclawičová, M. – Komrsková, Z. – Lukeš, D. – Kopřivová, M. – Poukarová, P.: //DIALEKT: nářeční korpus, verze 1 z 2. 62017//. Ústav Českého národního korpusu FF UK, Praha 2017Dostupný z WWW: http://www.korpus.cz\\+Goláňová, H. – Waclawičová, M. – Lukeš, D.: //DIALEKT: nářeční korpus, verze 2 z 23122021//. Ústav Českého národního korpusu FF UK, Praha 2021Retrieved from: http://www.korpus.cz\\
  
-Goláňová, H. (2015): A new dialect corpus: DIALEKTIn Katarína Gajdošová - Adriana Žáková (eds.): //Proceedings of the Eight International Conference Slovko 2015 (Natural Language Processing, Corpus LinguisticsLexicography)//. Lüdenscheid: RAM-Verlags36-44ISBN 978-3-942303-32-3.\\+Goláňová, H. – Waclawičová, M. – Komrsková, Z– Lukeš, D. – Kopřivová, M. – Poukarová, P.: //DIALEKT: nářeční korpusverze 1 z 2. 6. 2017//. Ústav Českého národního korpusu FF UKPraha 2017Retrieved from: http://www.korpus.cz\\
  
-Goláňová, H. – Kopřivová, M. – Lukeš, D. – Štěpán, M. (2015): Kartografické a geografické zpracování dat z mluvených korpusůIn //Korpus – gramatika – axiologie//11s. 42-54. ISSN: 1804-137X +Goláňová, H. – Waclawičová, M. (2019): The DIALEKT corpus and its possibilitiesJazykovedný časopis70(2)336-344. ISSN 0021-5597.
-</WRAP>+
  
-Budováním korpusu a koordinací projektu se zabývala //Hana Goláňová//přípravou korpusu a kontrolou transkripce //Martina Waclawičová//, transkripcí na ortografické úrovni //Zuzana Komrsková//technickou tvorbou korpusu //David Lukeš// a lemmatizaci a morfologické značkování připravili //Zuzana Komrsková//, //Marie Kopřivová//, //David Lukeš// a //Petra Poukarová//.+Komrsková, Z. - Kopřivová, M. - Lukeš, D. - Poukarová, P. - Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. //Jazykovedný časopis//, 68(2)219-228. ISSN 0021-8897. 
 + 
 +Goláňová, H. (2015): A new dialect corpus: DIALEKT. In Katarína Gajdošová - Adriana Žáková (eds.): //Proceedings of the Eight International Conference Slovko 2015 (Natural Language Processing, Corpus LinguisticsLexicography)//. Lüdenscheid: RAM-Verlag36-44. ISBN 978-3-942303-32-3.\\ 
 + 
 +</WRAP>
  
-===== Související odkazy =====+===== Related links =====
  
 <WRAP round box 70%> <WRAP round box 70%>
-[[cnk:dialekt:pravidla|Transkripce v korpusu DIALEKT]] • [[cnk:dialekt:prace|Práce s korpusem DIALEKT]] • [[cnk:ortofon|ORTOFON]] • [[cnk:diakorp|DIAKORP]] • [[pojmy:synchronni|Synchronní korpus]] • [[pojmy:reprezentativnost|Reprezentativnost]] • [[pojmy:diachronni|Diachronie, diachronní korpus]] • [[cnk:struktura#korpusy_mluvene|Mluvené korpusy]] • [[cnk:lemtag_mluv|Lemmatizace a tagování mluvených korpusů]]+[[en:cnk:ortofon|ORTOFON]] • [[en:cnk:diakorp|DIAKORP]] • [[en:cnk:lemtag_mluv|Lemmatization and tagging in spoken corpora]]
 </WRAP> </WRAP>