The Totalita corpus is a diachronic corpus of written Czech covering the period of the communist regime (1948–1989), which served as the material base for the Slovník komunistické totality (Dictionary of communist totalitarianism).
The corpus was taken over from the CD accompanying the dictionary, and neither the metadata nor the lemmatization and morphological mark-up have been changed. This means that the annotation does not correspond to the contemporary standards of annotation of the CNC corpora; yet, on the other hand, it made it possible to preserve the results of demanding manual lemmatization that has been carried out for the dictionary.
Name | Totalita | |
---|---|---|
Positions | Number of positions (tokens) | 15 350 741 |
Number of positions without punctuation | 12 909 992 | |
Structural attributes | Number of documents <doc> | 490 |
Number of sentences <s> | 813 311 | |
Composition | Rudé právo 1952 (tokens) | 4 410 585 |
Rudé právo 1969 (tokens) | 3 603 645 | |
Rudé právo 1977 (tokens) | 2 576 895 | |
Other publications (tokens) | 4 759 616 | |
Publication date | 2010 |
The data sources consist of two types of texts:
1. the Rudé právo daily: a total of 400 files, about 10 milions positions in total (2/3 of the total volume), specifically from three periods:
2. books and printed matter:
The aim of the Totalita corpus was not to cover the entire 41 years of the communist regime, it was only a criterion selection. It was based on three historically significant periods represented by the quarters of the Rudé právo daily listed above and largely also on the temporally correlated selection of books and prints. Moreover, this corpus does not represent the entire discourse of the then time, as it contains only the public, official propaganda-driven part of it. Thus, it contains the typical and dominant lexicon of totalitarianism (i.e., expressions specific in terms of ideology and politics, but especially propaganda: milice, kádrovat, uliční výbor, what F. Čermák called the V-language, viz „the language of communist rulers“), which constituted the content of all kinds of printed materials published at the time. However, the so-called O-language, i.e., the „language of the ruled“ (esenbák, mukl, vekslovat), is completely absent.
References
Čermák, F.: Slovník komunistické totality: léxémy, nominace a jejich užití. In: Čermák, F. – Cvrček, V. – Schmiedtová, V. (eds) (2010): Slovník komunistické totality. Praha: NLN, s. 16–39.
Čermák, F.: Jazyk totality a dneška: jak odráží realitu a ovlivňuje lidské vědomí. Language of Totalitarianism and of Today: How it Reflects Reality and Influences Human Consciousness. In: Jazyk v politických, ideologických a interkultúrnych vzťahoch. Sociolinguistica Slovaca 8. Veda, Bratislava 2015, s. 50–60.
Skoumalová, H. – Bartoň, T. – Cvrček, V. – Hnátková, M. – Kocek, J.: Totalita: korpus jazyka komunistické totality. Ústav Českého národního korpusu FF UK, Praha 2010. Dostupný z WWW: http://www.korpus.cz