Corpus of Academic Czech

The Corpus of Academic Czechs is a complement to Phrase Bank of Academic Czech and includes only Czech-language untranslated texts published after 2010 in scientific journals indexed in the Web of Science or Scopus, or, in some cases, EBSCO databases. Another criterion is the genre of the text: only studies and review articles are included in the corpus, not, for example, reviews or conference reports. In most cases, the texts are in the final editing stage, i.e. they have not undergone final editing or proofreading. The corpus contains articles from a total of 21 titles of Czech-language scientific journals and all six disciplines of the broader Frascati Manual are represented. A more precise composition of the corpus is given in the table below. The predominance of social sciences and humanities is due to the fact that relatively few Czech-language scientific articles are published in other disciplines.

Field	Title	Word count
1. Natural sciences		1 951 029
	Geografie	733 885
	Chemické listy	1 217 144
2. Engineering and technology		534 739
	Paliva	534 739
3. Medical and health sciences		1 811 902
	Cor et Vasa	643 254
	Česká a slovenská neurologie a neurochirurgie	1 168 648
4. Agricultural and veterinary sciences		406 257
	Zprávy lesnického výzkumu	406 257
5. Social sciences		5 120 839
	Československá psychologie	856 683
	Český lid	778 212
	Obrana a strategie	309 725
	Orbis scholae	578 303
	Revue církevního práva	665 229
	Sociologický časopis	1 053 680
	Studia paedagogica	673 108
	Vojenské rozhledy	205 899
6. Humanities and the arts		5 434 650
	Archeologické rozhledy	1 289 072
	Cornova	304 773
	Česká literatura	1 446 707
	Musicologica Brunensia	455 712
	Památky archeologické	409 157
	Slovo a slovesnost	760 468
	Studia theologica	768 761
Total		15 259 416

The total extent of the corpus is more than 15 million words (almost 20 million tokens) in 3,394 scientific articles. The technical processing of the corpus is based on the corpora of the SYN series. The main difference with the SYN series is that the documents here correspond to individual articles, not numbers. In addition, documents (articles) are further divided into individual sections (<div>) corresponding to text sections with an explicit class designation, which takes on the values introduction, discussion, conclusion and unknown. This breakdown was obtained by heuristic procedures and is therefore not always reliable. Metadata (authors, article title, number, year of publication, etc.) is available for all documents, which has undergone extensive manual revision. The lemmatization and morphological tagging of the corpus correspond to SYN2020.

The author's team would like to thank the editors of the journals included in the corpus, without whose support the Corpus of Academic Czech could not have been created.

How to cite Corpus of Academic Czech

Vondřička, P. – Kaderka, P. – Hoffmannová, J. – Homoláč, J. – Kocek, J. – Kopecký, J. – Křen, M. – Sherman, T.: Korpus akademické češtiny, verze 1 z 20. 11. 2023. Praha: Ústav Českého národního korpusu FF UK – Ústav pro jazyk český AV ČR, Praha 2023. Dostupný z WWW: http://www.korpus.cz

Homoláč, J. – Křen, M. – Kašpárková, A. – Etchegoyen Rosolová, K. – Hoffmannová, J. – Kaderka, P. – Kopecký, J. – Sherman, T. – Vondřička, P.: Akademické psaní a frázové banky. Slovo a slovesnost 84(4), 2023, 303-321. https://doi.org/10.58756/s4348418.