Václav Klaus Corpus
Václav Klaus Corpus ('VK') is an author corpus of texts by Václav Klaus which was created as a data basis for the thesis Václav Klaus’ Idiolect: A Corpus-based Analysis. The data used for the creation of the corpus were sourced from his official website, which contains texts intended primarily for this website, as well as texts originally published elsewhere (e.g. newspaper articles or magazine interviews) or created for specific events (e.g., presidential speeches or lectures at conferences).
In addition to Klaus’ texts, the website also contains texts for which Václav Klaus is only a co-author (e.g. joint statements) or for which he is not an author (e.g. communications from the press department of the presidential office). However, the 'VK' corpus is an author corpus in the narrower sense and, therefore, does not include these texts. For many texts, especially for a considerable portion of interviews, the mode (written or spoken) cannot be reliably determined. In the case of the spoken texts (the debates and some interviews), the situation is complicated by the apparent editorial modifications of Klaus’ speeches, the extent and nature of which vary considerably from text to text. To preserve the authenticity of the linguistic material, the corpus does not contain texts whose mode could not be clearly identified, nor does it include ‘purely’ spoken texts. The following four conditions can define the texts selected for the corpus:
- only texts published on the website www.klaus.cz;
- only texts whose sole (listed) author is Václav Klaus;
- only texts in the written and written-to-be-spoken (i.e. texts originally written but intended to be spoken) mode;
- only texts published up to and including 31st October 2023.
The entire corpus consists of 2,313 documents, which have been assigned the following 14 structural attributes:
structural attribute | description (values) |
---|---|
doc.id | document identification name (various values) |
doc.title | original title (various values) |
doc.lang | language (Czech) |
doc.src_lang | source language (Czech) |
doc.author | author (Václav Klaus) |
doc.pubDateYear | year of publishing (values in scope of 1995–2023) |
doc.date | date of publishing (various values) |
doc.period | office held by Klaus at the time of publishing (Prime Minister, MP, Speaker of the Chamber of Deputies, President, ex-president) |
doc.modus | mode (written, written-to-be-spoken) |
doc.registr | register (documents, professional literature, journalism, public speeches) |
doc.txtype | text type or genre (various values) |
doc.medium | type of the text source (various values) |
doc.source | text source (various values) |
doc.comment | commentary or additional information (various values) |
The corpus size is 1,750,891 tokens in total and 1,475,640 tokens excluding punctuation. All tokens were assigned the following 11 positional attributes during automatic annotation:
positional attribute | type count incl. punctuation | type count excl. punctuation |
---|---|---|
word | 98 892 | 98 830 |
lc | 88 814 | 88 752 |
sforma | 98 896 | 98 831 |
lemma | 36 604 | 36 555 |
lemma_lc | 35 893 | 35 844 |
sublemma | 37 942 | 37 892 |
sublemma_lc | 37 225 | 37 175 |
tag | 2013 | 2007 |
pos | 15 | 14 |
case | 8 | 8 |
verbtag | 77 | 77 |
How to cite
Schmid, O.: Korpus textů Václava Klause. Ústav Českého národního korpusu FF UK, Praha 2024. Dostupný z WWW: http://www.korpus.cz
Schmid, O.: Idiolekt Václava Klause: korpusová analýza. Diplomová práce. Ústav českého jazyka a teorie komunikace FF UK, Praha 2024. Dostupný z WWW: <http://hdl.handle.net/20.500.11956/191695>