Differences

This shows you the differences between two versions of the page.

--- en:cnk:dialekt [2017/07/18 14:58] – [Composition of DIALEKT and data collection] michalkren
+++ en:cnk:dialekt [2021/12/25 00:59] – lukes
@@ Line 1: / Line 1: @@
 ~~NOTOC~~
+<WRAP right 35%>
+^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial•v2**</fs>| <fs medium>**Dialekt_ort•v2**</fs>|
+^ Number of [[en:pojmy:token|positions (tokens)]] |  310 200|  298 539|
+^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols |  223 281|  223 327|
+^ Number of [[en:pojmy:word| word forms (words)]] |  33 715|  25 360|
+^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|recordings]] |  972||
+^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|utterances]] |  43 628||
+^ Number of speakers |  291||
+^ Length of recordings (hh:mm:ss.ms) |  27:43:21.423||
+^ Publication date |  December 23rd, 2021||
+</WRAP>
 ====== DIALEKT corpus ======
@@ Line 7: / Line 20: @@
 <WRAP right 35%>
-^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial**</fs>| <fs medium>**Dialekt_ort**</fs>|
+^ <fs medium>Corpus name</fs> | <fs medium>**Dialekt_dial•v1**</fs>| <fs medium>**Dialekt_ort•v1**</fs>|
 ^ Number of [[en:pojmy:token|positions (tokens)]] |  128 289|  126 131|
 ^ Number of [[en:pojmy:token|positions (tokens)]] without punctuation and other symbols |  99 552|  99 581|
@@ Line 15: / Line 28: @@
 ^ Number of speakers |  178||
 ^ Length of recordings (hh:mm:ss.ms) |  12:40:24.771||
+^ Publication date |  June 6th, 2017||
 </WRAP>
@@ Line 34: / Line 48: @@
 ====== Processing dialect recordings ======
-Dialect material in the **DIALEKT** corpus is processed with two transcription tiers – dialectological and orthographic, see [[en:cnk:dialekt:pravidla|transcription principles]]. The basic transcript is dialectological and is based on the rules for the transcription of scientific dialectological texts. The second transcription tier contains the orthographic transcription, which approaches the usual form of written texts and is comparable to the general rules established for spoken corpora in the Czech National Corpus (CNC).
+Dialect material in the **DIALEKT** corpus is processed with two transcription tiers – dialectological and orthographic, see [[cnk:dialekt:pravidla|transcription principles]] (Czech only). The basic transcript is dialectological and is based on the rules for the transcription of scientific dialectological texts. The second transcription tier contains the orthographic transcription, which approaches the usual form of written texts and is comparable to the general rules established for spoken corpora in the Czech National Corpus (CNC).
 **DIALEKT** is, similarly to the corpora **[[en:cnk:oral|ORAL]]** and **[[en:cnk:ortofon|ORTOFON]]** [[en:cnk:lemtag_mluv|lemmatized and morphologically tagged]]. Due to the extensive variability of dialect material and insufficient training data sets, the tagging and lemmatization process was extremely complicated, and it is necessary to keep this in mind when considering the outcome.
-After entering a query in the [[en:manualy:kontext:index|KonText]] interface, we are shown either only one selected transcription tier, or both tiers simultaneously as parallel corpora standing next to each other. It is only up to us to select the primary tier (dialectological or orthographic). This tier then displays all of the corpus functions – it is possible to play parts of the recording by the segment, change settings to display other information, [[en:pojmy:atributy_pozicni|positional]] or [[en:pojmy:atributy_strukturni#strukturni_atributy_mluvenych_korpusu|structural units and attributes]] etc., see [[en:cnk:dialekt:prace|Working with the DIALEKT corpus]].
+After entering a query in the [[en:manualy:kontext:index|KonText]] interface, we are shown either only one selected transcription tier, or both tiers simultaneously as parallel corpora standing next to each other. It is only up to us to select the primary tier (dialectological or orthographic). This tier then displays all of the corpus functions – it is possible to play parts of the recording by the segment, change settings to display other information, positional or structural units and attributes etc.
 ===== Acknowledgements =====
@@ Line 49: / Line 63: @@
 ===== How to cite  =====
 <WRAP round tip 70%>
+Goláňová, H. – Waclawičová, M. – Lukeš, D.: //DIALEKT: nářeční korpus, verze 2 z 23. 12. 2021//. Ústav Českého národního korpusu FF UK, Praha 2021. Retrieved from: http://www.korpus.cz\\
 Goláňová, H. – Waclawičová, M. – Komrsková, Z. – Lukeš, D. – Kopřivová, M. – Poukarová, P.: //DIALEKT: nářeční korpus, verze 1 z 2. 6. 2017//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz\\
-Goláňová, H. (2015): A new dialect corpus: DIALEKT. In Katarína Gajdošová - Adriana Žáková (eds.): //Proceedings of the Eight International Conference Slovko 2015 (Natural Language Processing, Corpus Linguistics, Lexicography)//. Lüdenscheid: RAM-Verlag, s. 36-44. ISBN 978-3-942303-32-3.\\
+Komrsková, Z. - Kopřivová, M. - Lukeš, D. - Poukarová, P. - Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. //Jazykovedný časopis//, 68(2), 219-228. ISSN 0021-8897.
+Goláňová, H. (2015): A new dialect corpus: DIALEKT. In Katarína Gajdošová - Adriana Žáková (eds.): //Proceedings of the Eight International Conference Slovko 2015 (Natural Language Processing, Corpus Linguistics, Lexicography)//. Lüdenscheid: RAM-Verlag, 36-44. ISBN 978-3-942303-32-3.\\
-Goláňová, H. – Kopřivová, M. – Lukeš, D. – Štěpán, M. (2015): Kartografické a geografické zpracování dat z mluvených korpusů. In //Korpus – gramatika – axiologie//, 11, s. 42-54. ISSN: 1804-137X
+Goláňová, H. – Kopřivová, M. – Lukeš, D. – Štěpán, M. (2015): Kartografické a geografické zpracování dat z mluvených korpusů. In //Korpus – gramatika – axiologie//, 11, 42-54. ISSN: 1804-137X
 </WRAP>
@@ Line 61: / Line 79: @@
 <WRAP round box 70%>
-[[en:cnk:dialekt:pravidla|Transcription in the DIALEKT corpus]] • [[en:cnk:dialekt:prace|Working with the DIALEKT corpus]] • [[en:cnk:ortofon|ORTOFON]] • [[en:cnk:diakorp|DIAKORP]] • [[en:pojmy:synchronni|Synchronic corpora]] • [[en:pojmy:reprezentativnost|Representativity]] • [[en:pojmy:diachronni|Diachrony, diachronic corpora]] • [[en:cnk:struktura#korpusy_mluvene|Spoken corpora]] • [[en:cnk:lemtag_mluv|Lemmatization and tagging in spoken corpora]]
+[[en:cnk:ortofon|ORTOFON]] • [[en:cnk:diakorp|DIAKORP]] • [[en:cnk:lemtag_mluv|Lemmatization and tagging in spoken corpora]]
 </WRAP>

Trace:

Differences

Search

Navigation

Print/export

Tools

Languages

Licence