Totalita: Corpus of totalitarian language

The Totalita corpus is a diachronic corpus of written Czech covering the period of the communist regime (1948–1989), which served as the material base for the Slovník komunistické totality (Dictionary of communist totalitarianism).

The corpus was taken over from the CD accompanying the dictionary, and neither the metadata nor the lemmatization and morphological mark-up have been changed. This means that the annotation does not correspond to the contemporary standards of annotation of the CNC corpora; yet, on the other hand, it made it possible to preserve the results of demanding manual lemmatization that has been carried out for the dictionary.

Name		Totalita
Positions	Number of positions (tokens)	15 350 741
Positions	Number of positions without punctuation	12 909 992
Structural attributes	Number of documents <doc>	490
Structural attributes	Number of sentences <s>	813 311
Composition	Rudé právo 1952 (tokens)	4 410 585
	Rudé právo 1969 (tokens)	3 603 645
	Rudé právo 1977 (tokens)	2 576 895
	Other publications (tokens)	4 759 616
Publication date		2010

Composition of the Totalita corpus

The data sources consist of two types of texts:

1. the Rudé právo daily: a total of 400 files, about 10 milions positions in total (2/3 of the total volume), specifically from three periods:

year 1952 (last two quarters): 6 June to 31 December 1952
year 1969 (second quarter): 1 April to 31 April 1969
year 1977 (first quarter): 3 January to 31 March 1977

2. books and printed matter:

91 books, totaling approximately 5 million items (1/3 of the total volume), from 1952 (23 books), 1969 (10 books), and 1977 (58 books)

The aim of the Totalita corpus was not to cover the entire 41 years of the communist regime, it was only a criterion selection. It was based on three historically significant periods represented by the quarters of the Rudé právo daily listed above and largely also on the temporally correlated selection of books and prints. Moreover, this corpus does not represent the entire discourse of the then time, as it contains only the public, official propaganda-driven part of it. Thus, it contains the typical and dominant lexicon of totalitarianism (i.e., expressions specific in terms of ideology and politics, but especially propaganda: milice, kádrovat, uliční výbor, what F. Čermák called the V-language, viz “the language of communist rulers”), which constituted the content of all kinds of printed materials published at the time. However, the so-called O-language, i.e., the “language of the ruled” (esenbák, mukl, vekslovat), is completely absent.

References

Čermák, F.: Slovník komunistické totality: léxémy, nominace a jejich užití. In: Čermák, F. – Cvrček, V. – Schmiedtová, V. (eds) (2010): Slovník komunistické totality. Praha: NLN, s. 16–39.

Čermák, F.: Jazyk totality a dneška: jak odráží realitu a ovlivňuje lidské vědomí. Language of Totalitarianism and of Today: How it Reflects Reality and Influences Human Consciousness. In: Jazyk v politických, ideologických a interkultúrnych vzťahoch. Sociolinguistica Slovaca 8. Veda, Bratislava 2015, s. 50–60.

How to cite the Totalita corpus

Skoumalová, H. – Bartoň, T. – Cvrček, V. – Hnátková, M. – Kocek, J.: Totalita: korpus jazyka komunistické totality. Ústav Českého národního korpusu FF UK, Praha 2010. Dostupný z WWW: http://www.korpus.cz