Differences

This shows you the differences between two versions of the page.

--- en:cnk:ortofon [2024/06/18 19:03] – Correct stats (main change: number of speakers in v2 by speaker_id, not nickname) vhorky
+++ en:cnk:ortofon [2026/06/30 12:38] (current) – michalkren
@@ Line 1: / Line 1: @@
 ====== Corpus of informal spoken Czech with multi-tier transcription: ORTOFON ======
-The ORTOFON corpus captures spontaneous spoken language used in informal situations between speakers who know each other. It follows the [[en:cnk:oral|ORAL]] series of informal spoken Czech corpora in its data collection design. The recordings are transcribed in two tiers - orthographic and phonetic. Together with the [[en:cnk:dialekt|DIALEKT]] corpus, these are the first two spoken Czech corpora to have multi-tier transcription. Similar to the [[en:cnk:oral2013|ORAL2013]] corpus, speakers come from all over the Czech Republic and selected sociological information is collected about them. The corpus is lemmatized and morphologically tagged. The transcription is linked to the audio track and the audio can be played back in the KonText corpus interface.
+The ORTOFON corpus captures spontaneous spoken language used in informal situations between speakers who know each other. It follows the [[en:cnk:oral|ORAL]] series of informal spoken Czech corpora in its data collection design. The recordings are transcribed in two tiers - orthographic and phonetic, using the [[https://archive.mpi.nl/tla/elan|ELAN]] tool, developed in the Max Planck Institute for Psycholinguistics, Nijmegen((ELAN (Version 7.1) [Computer software]. (2026). Nijmegen: Max Planck Institute for Psycholinguistics. Retrieved from https://archive.mpi.nl/tla/elan
+)). Together with the [[en:cnk:dialekt|DIALEKT]] corpus, these are the first two spoken Czech corpora to have multi-tier transcription. Similar to the [[en:cnk:oral2013|ORAL2013]] corpus, speakers come from all over the Czech Republic and selected sociological information is collected about them. The corpus is lemmatized and morphologically tagged. The transcription is linked to the audio track and the audio can be played back in the KonText corpus interface.
 The ORTOFON corpus allows us to explore various aspects of spoken language, i.e. lexis, morphology, syntax, pragmatics, dialogue construction. The corpus is not primarily intended for dialectological ((The [[en:cnk:dialekt|DIALEKT]] corpus is intended for this kind of research.)) or phonetic research, even though a simplified phonetic transcription allows us to verify the existence of pronunciation or regional variants, or phenomena related to pronunciation.
@@ Line 9: / Line 10: @@
 <WRAP 45%>
 ^ <fs medium>Name</fs> | <fs medium>[[en:cnk:ortofon|ORTOFON]]•v1</fs> | <fs medium>[[cnk:ortofon|ORTOFON]]•v2</fs> | <fs medium>[[cnk:ortofon|ORTOFON]]•v3</fs> |
-^ Number of [[en:pojmy:token|positions (tokens)]] |  1 236 508 |  2 560 590 |  2 976 740 |
+^ Number of [[en:pojmy:token|positions (tokens)]] |  1 236 508 |  2 560 590 |  2 976 742 |
-^ Number of [[en:pojmy:token|positions (tokens)]] without puctuation, hesitations and interjections |  1 014 786 |  2 101 214 |  2 445 792 |
+^ Number of [[en:pojmy:token|positions (tokens)]] without puctuation, hesitations and interjections |  1 014 786 |  2 101 214 |  2 445 793 |
 ^ Number of [[en:pojmy:word|word forms (words)]] |  65 294 |  101 500 |  110 127 |
 ^ Number of [[en:pojmy:atributy_strukturni#struktura_korpusu_mluvene_cestiny|conversations recorded]] |  332 |  615 |  697 |
@@ Line 34: / Line 35: @@
 ===== Morphological tagging of the ORTOFON corpus =====
-The ORTOFON v3 corpus is automatically [[en:pojmy:tag|annotated]] with [[en:cnk:syn2020#morphological_tagging|a new morphological tag]] according to the SYN2020 standard. It recognizes [[en:cnk:syn2020#multiple_lemmatization_and_tagging_aggregate|aggregates]] (e.g., //vidělas//, //zač//), uses [[en:cnk:syn2020|double-level lemmatization]], and has a verb tag ([[en:cnk:syn2020#verb_tagging_verbtag|verbtag]]).
+The ORTOFON v3 corpus is automatically [[en:pojmy:tag|annotated]] with [[en:cnk:syn2020#morphological_tagging|a new morphological tag]] according to the [[en:cnk:anotacni_standard_cnk|unified CNC annotation scheme]]. It recognizes [[en:cnk:syn2020#multiple_lemmatization_and_tagging_aggregate|aggregates]] (e.g., //vidělas//, //zač//), uses [[en:cnk:syn2020|double-level lemmatization]], and has a verb tag ([[en:cnk:syn2020#verb_tagging_verbtag|verbtag]]).
 Substandard variants and forms typical of dialects and spontaneous speech are also tagged in the corpus. Special variants of words are distinguished by their own sublemma (e.g. //poslúchat// under the lemma //poslouchat//), special forms tagged only in the spoken corpus have the number 9 in the last tag position (e.g. the form //jezdijó// has the tag  ''%%VB-P---3P-AAI-9%%'').
@@ Line 79: / Line 80: @@
 ===== ORTOFON v3 (2024) =====
-The 3rd version of the ORTOFON corpus was published in 2024. It contains 110 127 words and captures 1 121 speakers from all over the Czech Republic in 697 recordings, made between 2012 and 2020, totaling 243 hours. It also includes data from both previous versions of the corpus. The transcription at the orthographic and phonetic level as well as the corresponding audio recording are available in the KonText corpus interface. For this version, a number of inconsistencies in the transcription have been removed and a number of corrections have been made.
+The 3rd version of the ORTOFON corpus was published in 2024. It contains 2,445,793 words and captures 1,121 speakers from all over the Czech Republic in 697 recordings, made between 2012 and 2020, totalling 243 hours. It also includes data from both previous versions of the corpus. Like the second version, this one too is not balanced. The transcription at the orthographic and phonetic level as well as the corresponding audio recording are available in the KonText corpus interface. For this version, a number of inconsistencies in the transcription have been removed and a number of corrections have been made.
 The ORTOFON v3 corpus is automatically **annotated according to the SYN2020 standard**, see [[en:cnk:ortofon#morphological_tagging_of_the_ortofon_corpus|above]] for more details.
@@ Line 91: / Line 92: @@
 <WRAP round tip 70%>
+**Corpus as a language resource**
+Lukeš, D. – Kopřivová, M. – Laubeová, Z. – Poukarová, P. – Horký, V. – Jelínek, T. – Křivan, J. – Waclawičová, M. – Benešová, L. – Škarpová, M.:  //ORTOFON v3: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2024. Retrieved from: http://www.korpus.cz
 Kopřivová, M. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON v2: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2020. Retrieved from: http://www.korpus.cz
 Kopřivová, M. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Škarpová, M.: //ORTOFON v1: Korpus neformální mluvené češtiny s víceúrovňovým přepisem//. Ústav Českého národního korpusu FF UK, Praha 2017. Retrieved from: http://www.korpus.cz
+**References**
 Komrsková, Z. – Kopřivová, M. – Lukeš, D. – Poukarová, P. – Goláňová, H. (2017): New Spoken Corpora of Czech: ORTOFON and DIALEKT. //Jazykovedný časopis//, 68(2), 219-228. ISSN 0021-8897.

Trace: • parlcorp • ukwac • nkjp • skript2012_znacky • klaus • aibrown • lists • veda • ksp • onomos

Differences

Search

Navigation

Print/export

Tools

Languages

Licence