AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:ortofon [2024/06/06 15:09] – [Data collection] martinawaclawicovaen:cnk:ortofon [2024/08/05 10:27] (current) – [ORTOFON v3 (2024)] v3 is not balanced vhorky
Line 1: Line 1:
 ====== Corpus of informal spoken Czech with multi-tier transcription: ORTOFON ====== ====== Corpus of informal spoken Czech with multi-tier transcription: ORTOFON ======
  
-The ORTOFON corpus, with its method of data collection, is a continuation of the corpora of informal spoken Czech from the [[en:cnk:oral|ORAL]] series. Together with the [[en:cnk:dialekt|DIALEKT]] corpus, it is one of the first two spoken corpora of the Czech language which have multi-tier transcription. Same as with the corpora of the ORAL series, ORTOFON also collects spontaneous spoken language used in informal situations between speakers who know each other. Similarly, as in the corpus [[en:cnk:oral2013|ORAL2013]], the speakers come from all over the Czech Republic and selected sociological data are collected about them.  +The ORTOFON corpus captures spontaneous spoken language used in informal situations between speakers who know each other. It follows the [[en:cnk:oral|ORAL]] series of informal spoken Czech corpora in its data collection design. The recordings are transcribed in two tiers - orthographic and phonetic. Together with the [[en:cnk:dialekt|DIALEKT]] corpus, these are the first two spoken Czech corpora to have multi-tier transcription. Similar to the [[en:cnk:oral2013|ORAL2013]] corpus, speakers come from all over the Czech Republic and selected sociological information is collected about them. The corpus is lemmatized and morphologically tagged. The transcription is linked to the audio track and the audio can be played back in the KonText corpus interface
- +
-ORTOFON is also the first corpus to be fully balanced regarding all the basic sociolinguistic speaker categories (gender, age group, level of education and region  of childhood residence). The corpus is [[en:cnk:lemtag_mluv|lemmatized morphologically tagged]] in the same manner as the ORAL corpus, the transcription is linked to the corresponding audio track.+
  
 The ORTOFON corpus allows us to explore various aspects of spoken language, i.e. lexis, morphology, syntax, pragmatics, dialogue construction. The corpus is not primarily intended for dialectological ((The [[en:cnk:dialekt|DIALEKT]] corpus is intended for this kind of research.)) or phonetic research, even though a simplified phonetic transcription allows us to verify the existence of pronunciation or regional variants, or phenomena related to pronunciation. The ORTOFON corpus allows us to explore various aspects of spoken language, i.e. lexis, morphology, syntax, pragmatics, dialogue construction. The corpus is not primarily intended for dialectological ((The [[en:cnk:dialekt|DIALEKT]] corpus is intended for this kind of research.)) or phonetic research, even though a simplified phonetic transcription allows us to verify the existence of pronunciation or regional variants, or phenomena related to pronunciation.
  
-The publication of ORTOFON in connection with the [[en:cnk:oral|ORAL]] corpus presents users the chance to explore informal spoken Czech in the most extensive data complex to date, covering a period of fifteen years (2002-2017).+The publication of ORTOFON in connection with the [[en:cnk:oral|ORAL]] corpus presents users the chance to explore informal spoken Czech in the most extensive data complex to date, covering a period of fifteen years (2002-2020).
  
 <WRAP 45%> <WRAP 45%>
 ^ <fs medium>Name</fs> | <fs medium>[[en:cnk:ortofon|ORTOFON]]•v1</fs> | <fs medium>[[cnk:ortofon|ORTOFON]]•v2</fs> | <fs medium>[[cnk:ortofon|ORTOFON]]•v3</fs> | ^ <fs medium>Name</fs> | <fs medium>[[en:cnk:ortofon|ORTOFON]]•v1</fs> | <fs medium>[[cnk:ortofon|ORTOFON]]•v2</fs> | <fs medium>[[cnk:ortofon|ORTOFON]]•v3</fs> |
-^ Number of [[en:pojmy:token|positions (tokens)]] |  1 236 508 |  2 560 590 |  XXX +^ Number of [[en:pojmy:token|positions (tokens)]] |  1 236 508 |  2 560 590 |  2 976 742 
-^ Number of [[en:pojmy:token|positions (tokens)]] without puctuation, hesitations and interjections |  1 014 786 |  2 101 214 |  XXX +^ Number of [[en:pojmy:token|positions (tokens)]] without puctuation, hesitations and interjections |  1 014 786 |  2 101 214 |  2 445 793 
-^ Number of [[en:pojmy:word|word forms (words)]] |  65 294 |  101 502 |  XXX |   +^ Number of [[en:pojmy:word|word forms (words)]] |  65 294 |  101 500 |  110 127 |   
-^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|conversations recorded]] |  332 |  615 |  XXX +^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|conversations recorded]] |  332 |  615 |  697 
-^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|utterances]] |  172 736 |  360 248 |  XXX +^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|utterances]] |  172 736 |  360 248 |  419 533 
-^ Number of unique (different) speakers |  624 |  960 |  XXX |   +^ Number of unique (different) speakers |  625 |  1020 |  1 121 |   
-^ Length of recordings [hh:mm:ss.ms] |  102:41:14.247 |  210:09:35.155 |  XXX:XX:XX.XXX |  +^ Length of recordings [hh:mm:ss.ms] |  102:41:14.247 |  210:09:35.155 |  243:00:07.232 |  
 </WRAP> </WRAP>
  
Line 24: Line 22:
 The corpus captures only informal, spontaneous and natural situations. The material was collected in accordance with the [[en:cnk:oral2013#slozeni_korpusu_a_sber_dat|criteria]] concerning the corpora of the ORAL series: The corpus captures only informal, spontaneous and natural situations. The material was collected in accordance with the [[en:cnk:oral2013#slozeni_korpusu_a_sber_dat|criteria]] concerning the corpora of the ORAL series:
  
- * physical presence of all speakers in one place (exceptions are telephone conversations on speakerphone, or Skype or Zoom communications, where all participating speakers are recorded throughout the conversation); +  * physical presence of all speakers in one place (exceptions are telephone conversations on speakerphone, or Skype or Zoom communications, where all participating speakers are recorded throughout the conversation); 
- * dialogicality of speeches (two or more speakers talking); +  * dialogicality of speeches (two or more speakers talking); 
- * close relationship between the speakers; +  * close relationship between the speakers; 
- * unpreparedness, spontaneity of speech; +  * unpreparedness, spontaneity of speech; 
- * non-public and informal communication situations.+  * non-public and informal communication situations.
  
 Due to the presence of the phonetic transcription tier, a greater emphasis was placed on the sound quality of recordings. Selected sociological data about the situation and the speakers were recorded. The recordings capture adult native speakers of the Czech language from all parts of the Czech Republic. The maximum possible degree of authenticity of the individual recordings was achieved by the fact that the speakers were mostly not informed about the recording in advance, but only after it had been completed. All recorded speakers agreed to the use of the recordings for the purposes of the CNK. Due to the presence of the phonetic transcription tier, a greater emphasis was placed on the sound quality of recordings. Selected sociological data about the situation and the speakers were recorded. The recordings capture adult native speakers of the Czech language from all parts of the Czech Republic. The maximum possible degree of authenticity of the individual recordings was achieved by the fact that the speakers were mostly not informed about the recording in advance, but only after it had been completed. All recorded speakers agreed to the use of the recordings for the purposes of the CNK.
Line 65: Line 63:
  
 In its first version, published in 2017, the ORTOFON corpus was the first corpus that was fully balanced across all basic sociolinguistic categories of speakers (gender, age group, level of education, and the dialectal region of childhood residence).\\ In its first version, published in 2017, the ORTOFON corpus was the first corpus that was fully balanced across all basic sociolinguistic categories of speakers (gender, age group, level of education, and the dialectal region of childhood residence).\\
-The ORTOFON v1 corpus is composed of 332 recordings from the years 2012–2017 and contains 1 014 786 orthographic words, i.e. a total of 1 236 508 positions; a total of 624 different speakers appear in the probes. The recordings were acquired in Bohemia, Moravia, and Silesia, and their total length measures almost 103 hours. More quantitative data can be found on the page dedicated to the [[cnk:struktura_ortofon|composition of the corpus]] (Czech only).+The ORTOFON v1 corpus is composed of 332 recordings from the years 2012–2017 and contains 1 014 786 orthographic words, i.e. a total of 1 236 508 positions; a total of 624 different speakers appear in the probes. The recordings were acquired in Bohemia, Moravia, and Silesia, and their total length measures almost 103 hours. More quantitative data can be found on the page dedicated to the [[cnk:struktura_ortofon|composition of the corpus]] (in Czech only).
 ===== ORTOFON v1 - corpus balance ===== ===== ORTOFON v1 - corpus balance =====
  
Line 81: Line 79:
 ===== ORTOFON v3 (2024) ===== ===== ORTOFON v3 (2024) =====
  
-The 3rd version of the ORTOFON corpus was published in 2024. It includes data from both previous versions. It contains xxx words and captures xxx speakers from all over the Czech Republic in xxx recordings, made between 2012 and 2020 and lasting xxx minutes.+The 3rd version of the ORTOFON corpus was published in 2024. It contains 110 127 words and captures 1 121 speakers from all over the Czech Republic in 697 recordings, made between 2012 and 2020, totaling 243 hours. It also includes data from both previous versions of the corpus. Like the second version, this one too is not balanced. The transcription at the orthographic and phonetic level as well as the corresponding audio recording are available in the KonText corpus interface. For this version, a number of inconsistencies in the transcription have been removed and a number of corrections have been made. 
 + 
 +The ORTOFON v3 corpus is automatically **annotated according to the SYN2020 standard**, see [[en:cnk:ortofon#morphological_tagging_of_the_ortofon_corpus|above]] for more details.
 ===== Acknowledgments ===== ===== Acknowledgments =====
  
Line 91: Line 91:
  
 <WRAP round tip 70%> <WRAP round tip 70%>
 +Lukeš, D. – Kopřivová, M. – Laubeová, Z. – Poukarová, P. – Horký, V. – Jelínek, T. – Křivan, J. – Waclawičová, M. – Benešová, L. – Škarpová, M.:  //ORTOFON v3: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2024. Retrieved from: http://www.korpus.cz
 +
 Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON v2: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from: http://www.korpus.cz Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON v2: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from: http://www.korpus.cz
  
 Kopřivová, M. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON v1: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz Kopřivová, M. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON v1: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz
  
-Komrsková, Z. Kopřivová, M. Lukeš, D. Poukarová, P. Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. //Jazykovedný časopis//, 68(2), 219-228. ISSN 0021-8897.+Komrsková, Z. – Kopřivová, M. – Lukeš, D. – Poukarová, P. – Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. //Jazykovedný časopis//, 68(2), 219-228. ISSN 0021-8897.
  
 Kopřivová M. – Goláňová H. – Klimešová P. – Komrsková Z. – Lukeš D. (2014): Multi-tier Transcription of Informal Spoken Czech: The ORTOFON Corpus Approach. In //Complex Visibles Out There//. Olomouc: Univerzita Palackého v Olomouci, 529-544. Kopřivová M. – Goláňová H. – Klimešová P. – Komrsková Z. – Lukeš D. (2014): Multi-tier Transcription of Informal Spoken Czech: The ORTOFON Corpus Approach. In //Complex Visibles Out There//. Olomouc: Univerzita Palackého v Olomouci, 529-544.
  
-Kopřivová M. – Goláňová H. – Klimešová P. – Lukeš D.(2014): Mapping Diatopic and Diachronic Variation in Spoken Czech: the ORTOFON and DIALEKT Corpora. In //Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)//. Reykjavík, Iceland, European Language Resources Association, 376-382.+Kopřivová M. – Goláňová H. – Klimešová P. – Lukeš D. (2014): Mapping Diatopic and Diachronic Variation in Spoken Czech: the ORTOFON and DIALEKT Corpora. In //Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)//. Reykjavík, Iceland, European Language Resources Association, 376-382.
 </WRAP> </WRAP>