Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
en:cnk:ortofon [2017/07/06 10:34] – [Differences between the ORAL and ORTOFON corpora] veronikapojarova | en:cnk:ortofon [2017/07/18 14:48] – [Differences between the ORAL and ORTOFON corpora] michalkren | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Corpus of informal spoken Czech with multilevel | + | ====== Corpus of informal spoken Czech with multi-tier |
- | The ORTOFON corpus, with its method of data collection, is a continuation of the corpora of informal spoken Czech from the [[en: | + | The ORTOFON corpus, with its method of data collection, is a continuation of the corpora of informal spoken Czech from the [[en: |
ORTOFON is also the first corpus to be fully balanced regarding all the basic sociolinguistic speaker categories (gender, age group, level of education and region | ORTOFON is also the first corpus to be fully balanced regarding all the basic sociolinguistic speaker categories (gender, age group, level of education and region | ||
Line 20: | Line 20: | ||
===== Corpus composition and data collection | ===== Corpus composition and data collection | ||
- | The ORTOFON corpus is composed of 332 recordings from the years 2012–2017 and contains 1 014 786 orthographic words, i.e. a total of 1 236 508 positions; a total of 624 different speakers appear in the probes. The recordings were acquired in Bohemia, Moravia and Silesia, and their total length measures almost 103 hours. More quantitative data can be found on the page dedicated to the [[en:cnk: | + | The ORTOFON corpus is composed of 332 recordings from the years 2012–2017 and contains 1 014 786 orthographic words, i.e. a total of 1 236 508 positions; a total of 624 different speakers appear in the probes. The recordings were acquired in Bohemia, Moravia, and Silesia, and their total length measures almost 103 hours. More quantitative data can be found on the page dedicated to the [[cnk: |
- | The material was collected in accordance with the [[en: | + | The material was collected in accordance with the [[en: |
[{{: | [{{: | ||
Line 29: | Line 29: | ||
===== Corpus balance ===== | ===== Corpus balance ===== | ||
- | From the very beginning of data collection, special care was taken to achieve the maximum possible speaker variability with regard to dialectal regions. Over the course of the collection process, the material was adjusted in order to achieve a balanced corpus within the four basic sociolinguistic categories: gender, age, level of education and the dialectal region in which the speaker spent the majority of the first 15 years of his life. The first three categories, i.e. gender, age, education, were assigned binary values (see picture), while the fourth category was divided into ten groups i.e. ten dialectal regions. The following picture displays the distribution of the binary categories within one dialectal region. Each region should therefore contain the same number of words from men and women, from speakers of ages 18-34 years and those over 35 years, and from speakers with a high school education and those with a university education. | + | From the very beginning of data collection, special care was taken to achieve the maximum possible speaker variability with regard to dialectal regions. Over the course of the collection process, the material was adjusted in order to achieve a balanced corpus within the four basic sociolinguistic categories: gender, age, level of education and the dialectal region in which the speaker spent the majority of the first 15 years of his life. The first three categories, i.e. gender, age, education, were assigned binary values (see picture), while the fourth category was divided into ten groups i.e. ten dialectal regions. The following picture displays the distribution of the binary categories within one dialectal region. Each region should, therefore, contain the same number of words from men and women, from speakers of ages 18-34 years and those over 35 years, and from speakers with a high school education and those with a university education. |
[{{: | [{{: | ||
- | The basic concept was the idea of the same proportional representation of the sociolinguistic categories listed above, applied to the collection of material for all of the ČNK spoken corpora. Taking into account the target corpus size (1 000 000 words), the target for every category presented by the combination of four variables - gender(2) × age(2) × education (2) × dialectal region of residence up to the age of 15 years (10) - was set at 12 500 words. | + | The basic concept was the idea of the same proportional representation of the sociolinguistic categories listed above, applied to the collection of material for all of the ČNK spoken corpora. Taking into account the target corpus size (1 000 000 words), the target for every category presented by the combination of four variables - gender(2) × age(2) × education (2) × dialectal region of residence up to the age of 15 years (10) - was set at 12 500 words. |
In the effort to achieve the highest possible speaker variability withing the scope of each category, a minimum of five different speakers was set ((Feagin, C. (2002). Entering the community: Fieldwork. Chambers, J. K., Trudgill, P. and Schilling-Estes, | In the effort to achieve the highest possible speaker variability withing the scope of each category, a minimum of five different speakers was set ((Feagin, C. (2002). Entering the community: Fieldwork. Chambers, J. K., Trudgill, P. and Schilling-Estes, | ||
===== Differences between the ORAL and ORTOFON corpora ===== | ===== Differences between the ORAL and ORTOFON corpora ===== | ||
- | * **Multilevel | + | * **Multi-tier |
* **Pause punctuation based on pause length**: A section of the [[en: | * **Pause punctuation based on pause length**: A section of the [[en: | ||
- | * **Fully balanced corpus**: In the ORTOFON corpus, each combination of the four sociolinguistic variables is represented by a group of the same size; compare this to [[en: | + | * **Full balance**: In the ORTOFON corpus, each combination of the four sociolinguistic variables is represented by a group of the same size (cf. [[en: |
- | * **Varied representation of speakers from all over the Czech Republic**: The demarcation of the individual dialectal regions is based on the dialect divisions used in [[http:// | + | * **Varied representation of speakers from all over the Czech Republic**: The demarcation of the individual dialectal regions is based on the dialect divisions used in [[http:// |
* **Extended segment for listening**: | * **Extended segment for listening**: | ||
- | * **Alternative way of marking overlaps**: Overlaps in the transcript are marked with square brackets and are not divided in the audio so that they can be heared | + | * **Alternative way of marking overlaps**: Overlaps in the transcript are marked with square brackets and are not divided in the audio so that they can be heard better |
- | * **Audio availability**: The entire ORTOFON corpus is linked with audio tracks, so it is possible to listen to the given concordance (for the corpus [[en: | + | * **Availability of audio**: The entire ORTOFON corpus is linked with audio tracks, so it is possible to listen to the given concordance (for the corpus [[en: |
- | * **New metainformation**: | + | * **New metainformation**: |
- | ===== Poděkování | + | ===== Acknowledgments===== |
- | Děkujeme všem spolupracovníkům, kteří se podíleli na pořízení nahrávek, jejich přepisu a kontrole. | + | We thank all our collaborators who took part in the collection, transcription, and proofreading of the recordings. |
- | Jmenovitě chceme poděkovat především koordinátorům přepisu: PhDr. Iloně Adámkové, Mgr. Vendule Hálkové, PhDr. Daně Hlaváčkové, Mgr. Lence Klatovské, Mgr. Anně Marklové, PhDr. Evě Pasáčkové, Mgr. Pavle Smolové, Marice Svojanovské, Mgr. Pavlu Šturmovi, doc. Miloslavu | + | Namely, we would like to especially thank the transcription coordinators: PhDr. Ilona Adámková, Mgr. Vendula Hálková, Dr. Dana Hlaváčková, Mgr. Lenka Klatovská, Mgr. Anna Marklová, PhDr. Eva Pasáčková, Mgr. Pavla Smolová, Marika Svojanovská, Mgr. Pavel Šturm, Dr. Miloslav |
- | ===== Jak citovat | + | ===== How to cite ===== |
<WRAP round tip 70%> | <WRAP round tip 70%> | ||
- | Kopřivová, | + | Kopřivová, |
Kopřivová M. – Goláňová H. – Klimešová P. – Komrsková Z. – Lukeš D. (2014): Multi-tier Transcription of Informal Spoken Czech: The ORTOFON Corpus Approach. In //Complex Visibles Out There//. Olomouc: Univerzita Palackého v Olomouci, 529-544. | Kopřivová M. – Goláňová H. – Klimešová P. – Komrsková Z. – Lukeš D. (2014): Multi-tier Transcription of Informal Spoken Czech: The ORTOFON Corpus Approach. In //Complex Visibles Out There//. Olomouc: Univerzita Palackého v Olomouci, 529-544. | ||
Line 62: | Line 62: | ||
</ | </ | ||
- | ===== Související odkazy | + | ===== Related links ===== |
<WRAP round box 72%> | <WRAP round box 72%> | ||
- | [[cnk: | + | [[ORAL]] • [[ORAL2006]] • [[ORAL2008]] • [[ORAL2013]] • [[PMK]] • [[BMK]] • [[SCHOLA2010]] • [[en:cnk: |
</ | </ |