The Totalita corpus is a diachronic corpus of written Czech covering the period of the communist regime (1948–1989), which served as the material base for the Slovník komunistické totality (Dictionary of communist totalitarianism).

The corpus was taken over from the CD accompanying the dictionary, and neither the metadata nor the lemmatization and morphological mark-up have been changed. This means that the annotation does not correspond to the contemporary standards of annotation of the CNC corpora; yet, on the other hand, it made it possible to preserve the results of demanding manual lemmatization that has been carried out for the dictionary.

Positions Number of positions (tokens) 15 350 741
Number of positions without punctuation 12 909 992
Structural attributes Number of documents <doc> 490
Number of sentences <s> 813 311
Composition Rudé právo 1952 (tokens) 4 410 585
Rudé právo 1969 (tokens) 3 603 645
Rudé právo 1977 (tokens) 2 576 895
Other publications (tokens) 4 759 616
Publication date 2010

Composition of the Totalita corpus

The data sources consist of two types of texts:

1. the Rudé právo daily: a total of 400 files, about 10 milions positions in total (2/3 of the total volume), specifically from three periods:

  • year 1952 (last two quarters): 6 June to 31 December 1952
  • year 1969 (second quarter): 1 April to 31 April 1969
  • year 1977 (first quarter): 3 January to 31 March 1977

2. books and printed matter:

  • 91 books, totaling approximately 5 million items (1/3 of the total volume), from 1952 (23 books), 1969 (10 books), and 1977 (58 books)

The aim of the Totalita corpus was not to cover the entire 41 years of the communist regime, it was only a criterion selection. It was based on three historically significant periods represented by the quarters of the Rudé právo daily listed above and largely also on the temporally correlated selection of books and prints. Moreover, this corpus does not represent the entire discourse of the then time, as it contains only the public, official propaganda-driven part of it. Thus, it contains the typical and dominant lexicon of totalitarianism (i.e., expressions specific in terms of ideology and politics, but especially propaganda: milice, kádrovat, uliční výbor, what F. Čermák called the V-language, viz “the language of communist rulers”), which constituted the content of all kinds of printed materials published at the time. However, the so-called O-language, i.e., the “language of the ruled” (esenbák, mukl, vekslovat), is completely absent.


How to cite the Totalita corpus

