AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:ortofon [2017/07/06 10:39] – [Poděkování] Veronika Pojarováen:cnk:ortofon [2020/12/24 00:40] (current) David Lukeš
Line 1: Line 1:
-====== Corpus of informal spoken Czech with multilevel transcription: ORTOFON ====== +====== Corpus of informal spoken Czech with multi-tier transcription: ORTOFON ====== 
-The ORTOFON corpus, with its method of data collection, is a continuation of the corpora of informal spoken Czech from the [[en:cnk:oral|ORAL]] series. Together with the [[en:cnk:dialekt|DIALEKT]] corpus it is one of the first two spoken corpora of the Czech language which have a multilevel transcription. Same as with the corpora of the ORAL series, ORTOFON also collects spontaneous spoken languageused in informal situations between speakers who know each other. Similarly as in the corpus [[en:cnk:oral2013|ORAL2013]], the speakers come from all over the Czech Republic and selected sociological data are collected about them. + 
 +The ORTOFON corpus, with its method of data collection, is a continuation of the corpora of informal spoken Czech from the [[en:cnk:oral|ORAL]] series. Together with the [[en:cnk:dialekt|DIALEKT]] corpusit is one of the first two spoken corpora of the Czech language which have a multi-tier transcription. Same as with the corpora of the ORAL series, ORTOFON also collects spontaneous spoken language used in informal situations between speakers who know each other. Similarlyas in the corpus [[en:cnk:oral2013|ORAL2013]], the speakers come from all over the Czech Republic and selected sociological data are collected about them. 
  
 ORTOFON is also the first corpus to be fully balanced regarding all the basic sociolinguistic speaker categories (gender, age group, level of education and region  of childhood residence). The corpus is [[en:cnk:lemtag_mluv|lemmatized morphologically tagged]] in the same manner as the ORAL corpus, the transcription is linked to the corresponding audio track. ORTOFON is also the first corpus to be fully balanced regarding all the basic sociolinguistic speaker categories (gender, age group, level of education and region  of childhood residence). The corpus is [[en:cnk:lemtag_mluv|lemmatized morphologically tagged]] in the same manner as the ORAL corpus, the transcription is linked to the corresponding audio track.
Line 9: Line 10:
  
 <WRAP right 35%> <WRAP right 35%>
-^ <fs medium>Name</fs> | <fs medium>[[en:cnk:ortofon|ORTOFON]]</fs> |+^ <fs medium>Name</fs> | <fs medium>[[en:cnk:ortofon|ORTOFON]]•v1</fs> |
 ^ Number of [[en:pojmy:token|positions (tokens)]] |  1 236 508 |   ^ Number of [[en:pojmy:token|positions (tokens)]] |  1 236 508 |  
 ^ Number of [[en:pojmy:token|positions (tokens)]] without puctuation, hesitations and interjections |  1 014 786 | ^ Number of [[en:pojmy:token|positions (tokens)]] without puctuation, hesitations and interjections |  1 014 786 |
Line 20: Line 21:
  
 ===== Corpus composition and data collection  ===== ===== Corpus composition and data collection  =====
-The ORTOFON corpus is composed of 332 recordings from the years 2012–2017 and contains 1 014 786 orthographic words, i.e. a total of 1 236 508 positions; a total of 624 different speakers appear in the probes. The recordings were acquired in Bohemia, Moravia and Silesia, and their total length measures almost 103 hours. More quantitative data can be found on the page dedicated to the [[en:cnk:struktura_ortofon|composition of the corpus]]. 
  
-The material was collected in accordance with the [[en:cnk:oral2013#slozeni_korpusu_a_sber_dat|criteria]] concerning the corpora of the ORAL series. Due to the presence of the phonetic level of transcription, a greater emphasis was placed on the sound quality of recordings. The regional origin of the speakers who were included in the corpus is shown in the following map. The borders of the individual dialectal regions have been refined for the ORTOFON and DIALEKT corpora.+The ORTOFON corpus is composed of 332 recordings from the years 2012–2017 and contains 1 014 786 orthographic words, i.e. a total of 1 236 508 positions; a total of 624 different speakers appear in the probes. The recordings were acquired in Bohemia, Moravia, and Silesia, and their total length measures almost 103 hours. More quantitative data can be found on the page dedicated to the [[cnk:struktura_ortofon|composition of the corpus]] (Czech only). 
 + 
 +The material was collected in accordance with the [[en:cnk:oral2013#slozeni_korpusu_a_sber_dat|criteria]] concerning the corpora of the ORAL series. Due to the presence of the phonetic transcription tier, a greater emphasis was placed on the sound quality of recordings. The regional origin of the speakers who were included in the corpus is shown in the following map. The borders of the individual dialectal regions have been refined for the ORTOFON and DIALEKT corpora.
  
 [{{:cnk:ortofon:map.png?600 | Relative representations of speakers from various parts of the Czech Republic (number of speakers according to place of birth).}}] [{{:cnk:ortofon:map.png?600 | Relative representations of speakers from various parts of the Czech Republic (number of speakers according to place of birth).}}]
Line 29: Line 31:
  
 ===== Corpus balance ===== ===== Corpus balance =====
-From the very beginning of data collection, special care was taken to achieve the maximum possible speaker variability with regard to dialectal regions. Over the course of the collection process, the material was adjusted in order to achieve a balanced corpus within the four basic sociolinguistic categories: gender, age, level of education and the dialectal region in which the speaker spent the majority of the first 15 years of his life. The first three categories, i.e. gender, age, education, were assigned binary values (see picture), while the fourth category was divided into ten groups i.e. ten dialectal regions. The following picture displays the distribution of the binary categories within one dialectal region. Each region should therefore contain the same number of words from men and women, from speakers of ages 18-34 years and those over 35 years, and from speakers with a high school education and those with a university education.+ 
 +From the very beginning of data collection, special care was taken to achieve the maximum possible speaker variability with regard to dialectal regions. Over the course of the collection process, the material was adjusted in order to achieve a balanced corpus within the four basic sociolinguistic categories: gender, age, level of education and the dialectal region in which the speaker spent the majority of the first 15 years of his life. The first three categories, i.e. gender, age, education, were assigned binary values (see picture), while the fourth category was divided into ten groups i.e. ten dialectal regions. The following picture displays the distribution of the binary categories within one dialectal region. Each region shouldthereforecontain the same number of words from men and women, from speakers of ages 18-34 years and those over 35 years, and from speakers with a high school education and those with a university education.
  
 [{{:cnk:ortofon-vysece.png?400 | The distribution of binary sociolinguistic categories for one dialectal region. }}] [{{:cnk:ortofon-vysece.png?400 | The distribution of binary sociolinguistic categories for one dialectal region. }}]
  
-The basic concept was the idea of ​​the same proportional representation of the sociolinguistic categories listed above, applied to the collection of material for all of the ČNK spoken corpora. Taking into account the target corpus size (1 000 000 words), the target for every category presented by the combination of four variables - gender(2) × age(2) × education (2) × dialectal region of residence up to the age of 15 years (10) - was set at 12 500 words.+The basic concept was the idea of the same proportional representation of the sociolinguistic categories listed above, applied to the collection of material for all of the ČNK spoken corpora. Taking into account the target corpus size (1 000 000 words), the target for every category presented by the combination of four variables - gender(2) × age(2) × education (2) × dialectal region of residence up to the age of 15 years (10) - was set at 12 500 words.
 In the effort to achieve the highest possible speaker variability withing the scope of each category, a minimum of five different speakers was set ((Feagin, C. (2002). Entering the community: Fieldwork. Chambers, J. K., Trudgill, P. and Schilling-Estes, N., editors, //The Handbook of Language Variation and Change//, 20–39. Blackwell Publishing, Malden, MA.)). The aim of this provision to limit the influence of idiolect.  In the effort to achieve the highest possible speaker variability withing the scope of each category, a minimum of five different speakers was set ((Feagin, C. (2002). Entering the community: Fieldwork. Chambers, J. K., Trudgill, P. and Schilling-Estes, N., editors, //The Handbook of Language Variation and Change//, 20–39. Blackwell Publishing, Malden, MA.)). The aim of this provision to limit the influence of idiolect. 
  
 ===== Differences between the ORAL and ORTOFON corpora ===== ===== Differences between the ORAL and ORTOFON corpora =====
-  * **Multilevel transcription**: The transcription of spoken language in the ORTOFON corpus was carried out on two levels: **orthographic** and **phonetic**. The orthographic level serves primarily to ease the understanding of and orientation in the recorded conversation, whereas the phonetic level captures the actual realization of the utterance with the aid of a phonetic transcription. These two levels are supplemented by an additional **metalanguage** level, which captures the accompanying sounds produced by the speakers (e.g. laughter, coughing) or the present surroundings with a possible influence on the conversation (e.g. the sound of a telephone ringtone can lead to an interruption of the conversation). For more visit [[en:cnk:ortofon:pravidla|transcription principles]].+ 
 +  * **Multi-tier transcription**: The transcription of spoken language in the ORTOFON corpus was carried out on two tiers: **orthographic** and **phonetic**. The orthographic tier serves primarily to ease the understanding of and orientation in the recorded conversation, whereas the phonetic tier captures the actual realization of the utterance with the aid of a phonetic transcription. These two tiers are supplemented by an additional **metalanguage** tier, which captures the accompanying sounds produced by the speakers (e.g. laughter, coughing) or the present surroundings with a possible influence on the conversation (e.g. the sound of a telephone ringtone can lead to an interruption of the conversation).
   * **Pause punctuation based on pause length**: A section of the [[en:cnk:oral|ORAL]] corpus, specifically ORAL2013 and ORAL-Z, contains a pause punctuation based on the intuitive distinction between shorter and longer pauses based on the speech rate of the specific speaker. In the ORTOFON corpus, three types of pauses are distinguished based on temporal criteria: divides (less than 120 ms), pauses (120 ms - 2 s), long pauses (longer than 2 s).   * **Pause punctuation based on pause length**: A section of the [[en:cnk:oral|ORAL]] corpus, specifically ORAL2013 and ORAL-Z, contains a pause punctuation based on the intuitive distinction between shorter and longer pauses based on the speech rate of the specific speaker. In the ORTOFON corpus, three types of pauses are distinguished based on temporal criteria: divides (less than 120 ms), pauses (120 ms - 2 s), long pauses (longer than 2 s).
-  * **Fully balanced corpus**:  In the ORTOFON corpus, each combination of the four sociolinguistic variables is represented by a group of the same size; compare this to [[en:cnk:oral2013#co_ma_oral2013_s_korpusy_oral2006_a_oral2008_spolecneho|ORAL2013]]. +  * **Full balance**:  In the ORTOFON corpus, each combination of the four sociolinguistic variables is represented by a group of the same size (cf. [[en:cnk:oral2013#co_ma_oral2013_s_korpusy_oral2006_a_oral2008_spolecneho|ORAL2013]])
-  * **Varied representation of speakers from all over the Czech Republic**: The demarcation of the individual dialectal regions is based on the dialect divisions used in [[http://cja.ujc.cas.cz/cja.html|Czech language atlas]], however the borders have been further refined (see [[en:cnk:dialekt#mapa_narecnich_oblasti_cr| the map of dialectal regions]]). During the process of data collection, care was taken to achieve the variability of both the speakers and the municipalities from which they come.+  * **Varied representation of speakers from all over the Czech Republic**: The demarcation of the individual dialectal regions is based on the dialect divisions used in [[http://cja.ujc.cas.cz/cja.html|Czech language atlas]], howeverthe borders have been further refined (see [[en:cnk:dialekt#mapa_narecnich_oblasti_cr| the map of dialectal regions]]). During the process of data collection, care was taken to achieve the variability of both the speakers and the municipalities from which they come.
   * **Extended segment for listening**: The segment of each separate transcript can be as long as 25 words, which improves the experience of listening to the audio segment.   * **Extended segment for listening**: The segment of each separate transcript can be as long as 25 words, which improves the experience of listening to the audio segment.
-  * **Alternative way of marking overlaps**: Overlaps in the transcript are marked with square brackets and are not divided in the audio so that they can be heared better, compared to [[en:cnk:oral2013|ORAL2013]]. In the KonTextu corpus manager they are displayed as [[en:pojmy:atributy_strukturni#strukturni_atributy_mluvenych_korpusu|structural attributes]] (for more see [[en:kurz:hledani_ortofon| searching in ORTOFON]]). +  * **Alternative way of marking overlaps**: Overlaps in the transcript are marked with square brackets and are not divided in the audio so that they can be heard better (cf. [[en:cnk:oral2013|ORAL2013]]). 
-  * **Audio availability**: The entire ORTOFON corpus is linked with audio tracks, so it is possible to listen to the given concordance (for the corpus [[en:cnk:oral|ORAL]] this only applies to the ORAL-Z and ORAL2013 sections). +  * **Availability of audio**: The entire ORTOFON corpus is linked with audio tracks, so it is possible to listen to the given concordance (for the corpus [[en:cnk:oral|ORAL]] this only applies to the ORAL-Z and ORAL2013 sections). 
-  * **New metainformation**: The scope of metainformation collected regarding the recording and the individual speakers has been extended. For more information see the [[en:pojmy:atributy_strukturni#strukturni_atributy_korpusu_ortofon|overview of structural attributes]].+  * **New metainformation**: The scope of meta information collected regarding the recording and the individual speakers has been extended.
  
 +<WRAP right 35%>
 +^ <fs medium>Name</fs> | <fs medium>[[en:cnk:ortofon|ORTOFON]]•v2</fs> |
 +^ Number of [[en:pojmy:token|positions (tokens)]] |  2 560 590 |  
 +^ Number of [[en:pojmy:token|positions (tokens)]] without puctuation, hesitations and interjections |  2 101 214 |
 +^ Number of [[en:pojmy:word|word forms (words)]] |  101 502 |  
 +^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|conversations recorded]] |  615 |
 +^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|utterances]] |  360 248 |
 +^ Number of unique (different) speakers |  960 |  
 +^ Length of recordings [hh:mm:ss.ms] |  210:09:35.155 |  
 +</WRAP>
 +
 +===== Version 2 (2020) =====
 +
 +In 2020, a new version of the corpus was published, featuring recordings from 2012 to 2019. Unlike the original version, this new one is **not balanced** in any way. Its purpose is to provide access to as much of the collected material as possible. While collection of informal dialogues is ongoing, and some of the older material is still being processed for publication, this new version still contains twice as much data as the previous one.
 +
 +Apart from this, version 2 features many small improvements in the consistency of the transcription and in the annotation of the corpus.
 +
 +===== Acknowledgments =====
 +
 +We thank all our collaborators who took part in the collection, transcription, and proofreading of the recordings. 
 +
 +Namely, we would like to especially thank the transcription coordinators: PhDr. Ilona Adámková, Mgr. Vendula Hálková, Dr. Dana Hlaváčková, Mgr. Lenka Klatovská, Mgr. Anna Marklová, PhDr. Eva Pasáčková, Mgr. Pavla Smolová, Marika Svojanovská, Mgr. Pavel Šturm, Dr. Miloslav Vondráček and Mgr. Lenka Zábojová.
  
-===== Acknowledgments===== +===== How to cite =====
-We thank all our collaborators who took part in the collection, transcription and proofreading of the recordings. +
  
-Namely, we would like to especially thank the transcription coordinators: PhDr. Ilona Adámková, Mgr. Vendula Hálková, PhDr. Dana Hlaváčková, Mgr. Lenka Klatovská, Mgr. Anna Marklová, PhDr. Eva Pasáčková, Mgr. Pavla Smolová, Marika Svojanovská, Mgr. Pavel Šturm, doc. Miloslav Vondráček and Mgr. Lenka Zábojová. 
-===== Jak citovat ===== 
 <WRAP round tip 70%> <WRAP round tip 70%>
-Kopřivová, M. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2017Dostupný z WWW: http://www.korpus.cz+Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON v2: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2020Retrieved from: http://www.korpus.cz
  
-Kopřivová M. – Goláňová H. – Klimešová P. – Komrsková Z. – Lukeš D. (2014): Multi-tier Transcription of Informal Spoken Czech: The ORTOFON Corpus ApproachIn //Complex Visibles Out There//. Olomouc: Univerzita Palackého v Olomouci529-544.+KopřivováM. – KomrskováZ. – LukešD. – Poukarová, P– Škarpová, M.: //ORTOFON v1: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UKPraha 2017. Retrieved from: http://www.korpus.cz
  
-Kopřivová M. – Goláňová H. – Klimešová P. – Lukeš D.(2014): Mapping Diatopic and Diachronic Variation in Spoken Czech: the ORTOFON and DIALEKT CorporaIn //Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)//. ReykjavíkIcelandEuropean Language Resources Association, 376-382.+Komrsková, Z. - KopřivováM. - Lukeš, D. - Poukarová, P. - GoláňováH. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. //Jazykovedný časopis//, 68(2)219-228. ISSN 0021-8897.
  
 +Kopřivová M. – Goláňová H. – Klimešová P. – Komrsková Z. – Lukeš D. (2014): Multi-tier Transcription of Informal Spoken Czech: The ORTOFON Corpus Approach. In //Complex Visibles Out There//. Olomouc: Univerzita Palackého v Olomouci, 529-544.
  
 +Kopřivová M. – Goláňová H. – Klimešová P. – Lukeš D.(2014): Mapping Diatopic and Diachronic Variation in Spoken Czech: the ORTOFON and DIALEKT Corpora. In //Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)//. Reykjavík, Iceland, European Language Resources Association, 376-382.
 </WRAP> </WRAP>
  
-===== Související odkazy =====+===== Related links =====
  
 <WRAP round box 72%> <WRAP round box 72%>
-[[cnk:ortofon:pravidla|Pravidla pro přepis nahrávek v korpusu ORTOFON]] • [[kurz:hledani_ORTOFON|Hledání v korpusu ORTOFON]] • [[ORAL]] • [[ORAL2006]] • [[ORAL2008]] • [[ORAL2013]] • [[PMK]] • [[BMK]] • [[SCHOLA2010]] • [[cnk:dialekt|Dialekt]] • [[pojmy:mluveny|Korpus mluveného jazyka]] • [[pojmy:atributy_strukturni#strukturni_atributy_korpusu_rady_oral|Struktura korpusů ORAL]] • [[kurz:hledani_v_mluvenych_korpusech|Hledání v mluvených korpusech]] • [[cnk:lemtag_mluv|Lemmatizace a tagování mluvených korpusů]]+[[ORAL]] • [[ORAL2006]] • [[ORAL2008]] • [[ORAL2013]] • [[PMK]] • [[BMK]] • [[SCHOLA2010]] • [[en:cnk:dialekt|DIALEKT]] • [[en:cnk:lemtag_mluv|Lemmatization and tagging in spoken corpora]]
  </WRAP>  </WRAP>