The ORTOFON corpus captures spontaneous spoken language used in informal situations between speakers who know each other. It follows the ORAL series of informal spoken Czech corpora in its data collection design. The recordings are transcribed in two tiers - orthographic and phonetic. Together with the DIALEKT corpus, these are the first two spoken Czech corpora to have multi-tier transcription. Similar to the ORAL2013 corpus, speakers come from all over the Czech Republic and selected sociological information is collected about them. The corpus is lemmatized and morphologically tagged. The transcription is linked to the audio track and the audio can be played back in the KonText corpus interface.
The ORTOFON corpus allows us to explore various aspects of spoken language, i.e. lexis, morphology, syntax, pragmatics, dialogue construction. The corpus is not primarily intended for dialectological 1) or phonetic research, even though a simplified phonetic transcription allows us to verify the existence of pronunciation or regional variants, or phenomena related to pronunciation.
The publication of ORTOFON in connection with the ORAL corpus presents users the chance to explore informal spoken Czech in the most extensive data complex to date, covering a period of fifteen years (2002-2020).
Name | ORTOFON•v1 | ORTOFON•v2 | ORTOFON•v3 |
---|---|---|---|
Number of positions (tokens) | 1 236 508 | 2 560 590 | 2 976 742 |
Number of positions (tokens) without puctuation, hesitations and interjections | 1 014 786 | 2 101 214 | 2 445 793 |
Number of word forms (words) | 65 294 | 101 500 | 110 127 |
Number of conversations recorded | 332 | 615 | 697 |
Number of utterances | 172 736 | 360 248 | 419 533 |
Number of unique (different) speakers | 625 | 1020 | 1 121 |
Length of recordings [hh:mm:ss.ms] | 102:41:14.247 | 210:09:35.155 | 243:00:07.232 |
The corpus captures only informal, spontaneous and natural situations. The material was collected in accordance with the criteria concerning the corpora of the ORAL series:
Due to the presence of the phonetic transcription tier, a greater emphasis was placed on the sound quality of recordings. Selected sociological data about the situation and the speakers were recorded. The recordings capture adult native speakers of the Czech language from all parts of the Czech Republic. The maximum possible degree of authenticity of the individual recordings was achieved by the fact that the speakers were mostly not informed about the recording in advance, but only after it had been completed. All recorded speakers agreed to the use of the recordings for the purposes of the CNK.
The structures and structural attribute of the ORTOFON corpus are described on a separate page (in Czech only).
The ORTOFON v3 corpus is automatically annotated with a new morphological tag according to the SYN2020 standard. It recognizes aggregates (e.g., vidělas, zač), uses double-level lemmatization, and has a verb tag (verbtag).
Substandard variants and forms typical of dialects and spontaneous speech are also tagged in the corpus. Special variants of words are distinguished by their own sublemma (e.g. poslúchat under the lemma poslouchat), special forms tagged only in the spoken corpus have the number 9 in the last tag position (e.g. the form jezdijó has the tag VB-P---3P-AAI-9
).
The following specific tags are used in the first tag position (word type):
Tag | Meaning |
---|---|
E | fragments (incomplete words) |
H | nonverbal sounds (e.g. hezitation) |
M | comments by transcribers (in round brackets) |
W | anonymised sections (mainly names) |
Note: The anonymised sections are specified on a basic level word
: NP – surname, NJ – first name, NN – nickname, NM – place name, NO – other proper names, NT – last two digits of the telephone number.
The ORAL v1, ORTOFON v1 and ORTOFON v2 corpora are tagged with the prior morphological tagset used until 2020. Detailed information on the annotation of these previously published corpora can be found on a separate page.
In its first version, published in 2017, the ORTOFON corpus was the first corpus that was fully balanced across all basic sociolinguistic categories of speakers (gender, age group, level of education, and the dialectal region of childhood residence).
The ORTOFON v1 corpus is composed of 332 recordings from the years 2012–2017 and contains 1 014 786 orthographic words, i.e. a total of 1 236 508 positions; a total of 624 different speakers appear in the probes. The recordings were acquired in Bohemia, Moravia, and Silesia, and their total length measures almost 103 hours. More quantitative data can be found on the page dedicated to the composition of the corpus (in Czech only).
From the very beginning of data collection, special care was taken to achieve the maximum possible speaker variability with regard to dialectal regions. Over the course of the collection process, the material was adjusted in order to achieve a balanced corpus within the four basic sociolinguistic categories: gender, age, level of education and the dialectal region in which the speaker spent the majority of the first 15 years of his life. The first three categories, i.e. gender, age, education, were assigned binary values (see picture), while the fourth category was divided into ten groups i.e. ten dialectal regions. The following picture displays the distribution of the binary categories within one dialectal region. Each region should, therefore, contain the same number of words from men and women, from speakers of ages 18-34 years and those over 35 years, and from speakers with a high school education and those with a university education.
The basic concept was the idea of the same proportional representation of the sociolinguistic categories listed above, applied to the collection of material for all of the ČNK spoken corpora. Taking into account the target corpus size (1 000 000 words), the target for every category presented by the combination of four variables - gender(2) × age(2) × education (2) × dialectal region of residence up to the age of 15 years (10) - was set at 12 500 words. In the effort to achieve the highest possible speaker variability withing the scope of each category, a minimum of five different speakers was set 2). The aim of this provision to limit the influence of idiolect.
In 2020, a new version of the corpus was published, featuring recordings from 2012 to 2019. Unlike the original version, this new one is not balanced in any way. Its purpose is to provide access to as much of the collected material as possible. While collection of informal dialogues is ongoing, and some of the older material is still being processed for publication, this new version still contains twice as much data as the previous one. Apart from this, version 2 features many small improvements in the consistency of the transcription and in the annotation of the corpus.
The 3rd version of the ORTOFON corpus was published in 2024. It contains 110 127 words and captures 1 121 speakers from all over the Czech Republic in 697 recordings, made between 2012 and 2020, totaling 243 hours. It also includes data from both previous versions of the corpus. Like the second version, this one too is not balanced. The transcription at the orthographic and phonetic level as well as the corresponding audio recording are available in the KonText corpus interface. For this version, a number of inconsistencies in the transcription have been removed and a number of corrections have been made.
The ORTOFON v3 corpus is automatically annotated according to the SYN2020 standard, see above for more details.
We thank all our collaborators who took part in the collection, transcription, and proofreading of the recordings.
Namely, we would like to especially thank the transcription coordinators: PhDr. Ilona Adámková, Mgr. Vendula Hálková, Dr. Dana Hlaváčková, Mgr. Lenka Klatovská, Mgr. Anna Marklová, PhDr. Eva Pasáčková, Mgr. Pavla Smolová, Marika Svojanovská, Mgr. Pavel Šturm, Dr. Miloslav Vondráček and Mgr. Lenka Zábojová.
Lukeš, D. – Kopřivová, M. – Laubeová, Z. – Poukarová, P. – Horký, V. – Jelínek, T. – Křivan, J. – Waclawičová, M. – Benešová, L. – Škarpová, M.: ORTOFON v3: Korpus neformální mluvené češtiny s víceúrovňovým přepisem. Ústav Českého národního korpusu FF UK, Praha 2024. Retrieved from: http://www.korpus.cz
Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: ORTOFON v2: Korpus neformální mluvené češtiny s víceúrovňovým přepisem. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from: http://www.korpus.cz
Kopřivová, M. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: ORTOFON v1: Korpus neformální mluvené češtiny s víceúrovňovým přepisem. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz
Komrsková, Z. – Kopřivová, M. – Lukeš, D. – Poukarová, P. – Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. Jazykovedný časopis, 68(2), 219-228. ISSN 0021-8897.
Kopřivová M. – Goláňová H. – Klimešová P. – Komrsková Z. – Lukeš D. (2014): Multi-tier Transcription of Informal Spoken Czech: The ORTOFON Corpus Approach. In Complex Visibles Out There. Olomouc: Univerzita Palackého v Olomouci, 529-544.
Kopřivová M. – Goláňová H. – Klimešová P. – Lukeš D. (2014): Mapping Diatopic and Diachronic Variation in Spoken Czech: the ORTOFON and DIALEKT Corpora. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, European Language Resources Association, 376-382.
ORAL • ORAL2006 • ORAL2008 • ORAL2013 • PMK • BMK • SCHOLA2010 • DIALEKT • Lemmatization and tagging in spoken corpora