Prague Spoken Corpus

Name PMK
Number of positions (tokens) 819 267
Number of positions (tokens) without punctuation and other marks 674 992
Number of word forms (words) 49 089
Number of recordings of dialogues 304
Number of utterances 15 710
Number of speakers N/A
Length of recordings in mins. N/A

The Prague Spoken Corpus (PMK) is the first corpus of spoken Czech and it captures authentic spoken Czech, mainly colloquial and thematically unspecialised, or unlimited, from the Prague area and its surroundings. Because of the central and unique status of Prague, a major mixing of people from all regions of the Czech Republic takes place here and the language picture, thus, has to a large extent a countrywide character. Prague also has the most important media influence over the entire country. The recordings (a total of 304), which are fully anonymous and were gradually transcribed into an electronic format, come from 1988–1996, thus reflecting the language of the end of the previous social era and the beginning of a new one.

The PMK was acquired so that it would cover the four sociolinguistic variables in balanced proportions: the speaker's gender, age, education and type of speech, all of which have been split into two values for simplification. The gender is marked by MZ (MaleFemale). Age is marked by IV (IuniorVetus), that is older and younger, with the lower limit being c. 20 years of age (the language of adolescent youth is not fully stabilised) and the dividing line between them was the age of c. 35. The education is marked by abbreviations BA (BasisAltus), that is the lower, including both elementary and secondary education, and the higher, that is university education. The last variable, with abbreviations FN, represents formal and informal speech. The formal speech is a monologue created by a succession of replies to questions asked by the interviewer (to prevent influencing the replies, be it by standard or non-standard, they were of mixed character standard – non-standard). They concerned such broad topics as school, youth, work, etc. and were neither recorded nor transcribed (they were identical for all recordings). The non-standard speech is in fact a dialogue set of speeches of two speakers, who know each other; the topics of their conversations were not determined beforehand, they chose them themselves. The recordings strove for proportional balance of tens of sociological combinations (of the MIBF, MIAF, MIAN, etc. types) created this way, which makes them in this sense representative for all the variables. The fact that the replies were not prepared ensures maximum possible spontaneity of the used language.

The method of the recording transcription, apparent in the text, strove to capture the spoken language as faithfully and comprehensibly as possible, however not in the dialectological way. That is why it also naturally exhibits more variation and sometimes even individual approaches of the transcribers.

The authors of the PMK are in different proportions mainly Anna Adamovičová, František Čermák, Jiří Pešička, Josef Šimandl, Jitka Šonková, Petr Savický and Zdena Smetanová from the Faculty of Arts, Charles University; although many students also helped with the recordings.

František Čermák (the project head, Prague 2001)

Citing PMK

Čermák, F. – Adamovičová, A. – Pešička, J.: PMK (Pražský mluvený korpus): přepisy nahrávek pražské mluvy z 90. let 20. století. Ústav Českého národního korpusu FF UK, Praha 2001. Available on-line:

See also