Obsah

Overview of text classification in SYN2015

Texts in the SYN2015 corpus are divided into three main groups (txtype_group):

  1. FIC: fiction
  2. NFC: non-fiction
  3. NMG: newspapers and magazines

Each of these groups makes up one third of all texts in the corpus.

1. Fiction

Changes with regard to the previous SYN series classification:

The new division of fiction (FIC) on the txtype level is:

  1. NOV: prose – novels and novellas
  2. COL: shorter prose – collections of short stories and other shorter prose texts (e.g. essays, blog entries etc.)
  3. VER: poetry – collections of poetry, marginally song lyrics
  4. SCR: drama - theatre plays, marginally also screenplays for film
  5. X: unclassified – works which cannot be clearly assigned to one of the above mentioned categories (e.g. mixed genre texts, collections of aphorisms, anecdotes, etc.)

2. Non-fiction

The most significant changes compared to the previous SYN series classification:

Non-fiction literature (NFC) on the txtype level is newly divided into:

  1. SCI: scientific literature – scientific texts, including academic publications and university textbooks
  2. PRO: professional literature – texts intended for professionals in a given field, including specialized periodicals (e.g. Logistika, Lékařské listy, Sestra, Zeměměřič, Stavitel, Konstrukce)
  3. POP: popular literature – texts intended for a lay audience with an interest in the field (e.g. Bydlí s námi sladkovodní želva, Botanické zahrady a arboreta České republiky, Praktický houbař)
  4. ADM: administrative texts – rules and regulations, meeting minutes, instructions and guidelines, annual reports, etc.
  5. MEM: memoirs, (auto)biographies – memoirs, (auto)biographies (with the exception of fictionalized autobiographies, which are included in the fiction category), written correspondence (e.g. Bojoval jsem u Berlína, Chirurgovy poznámky, Meda Mládková - Můj úžasný život)

Genre_group

The NFC category contains a new layer of classification, genre_group, which is relevant for texts in the SCI, PRO and POP categories. It was created by grouping together the individual fields (labelled genre in the CNC) into larger groups: humanities (HUM), social sciences (SSC), natural sciences (NAT) and formal and technical sciences (FTS), please refer to the table below.

On the genre level, in other words the most detailed level of text classification, the individual texts (with very few exceptions) were classified in compliance with the subject-based categorization which is used in the National Library of the Czech Republic. Ambiguous cases were resolved through the consensus of several classifiers. The fields are shown in detail in the table below.

HUM: humanities SSC: social sciences NAT: natural sciences FTS: formal and technical sciences ITD: interdisciplinary
ANT: anthropology, ethnography
THE: theatre, film, dance
PHI: philosophy, religion
HIS: history
LAN: philology
INF: library and information science
ART: art, architecture
ECO: economy, business, logistics
POL: politics, military
LAW: law
PSY: psychology
SOC: sociology
REC: sports, recreation, hobbies
EDU: education
BIO: biology
PHY: physics
GEO: geography, geology
CHE: chemistry
MED: medicine
AGR: agriculture
MAT: mathematics
TEC: technology
ICT: information technology
ITD: interdisciplinary

3. Newspapers and magazines

The most significant changes compared to the previous SYN series classification:

Share of texts in the LEI and NEW categories per year.

The category of newspapers and magazines (NMG) on the txtype level is newly divided into:

  1. NEW: traditional newspapers (emphasis on current events, political news, news from home and abroad)
  2. LEI: leisure magazines (predominantly special interest magazines)

Traditional newspapers (NEW)

The category of traditional newspapers (NEW) is divided on the genre level into:

Leisure magazines (LEI)

The category of leisure magazines (LEI) is further divided on the genre level based on the topic:

The representation of major titles in the newspapers and magazines category.

Sections

Selected periodicals (Mladá fronta Dnes, Právo, Hospodářské noviny, Lidové noviny, Deníky Bohemia, Týden, Deníky Moravia, Respekt, Regionální týdeník, Blesk, Dobrý den s kurýrem, Metro, E15, Jihlavské listy, Sedmička, Aha! neděle, Nedělní Blesk) furthermore offer information about the section in which the article was originally published. This information is contained in the section attribute, which characterizes the structure of the <text> and has one of the following values:

Overall classification

The following table offers a comprehensive summary of how texts are divided into categories based on the txtype_group, txtype, genre_group and genre attributes.

txtype_group txtype genre_group genre
FIC: fiction NOV: novels X: other X: other
COL: short stories
VER: poetry
SCR: drama, screenplays
X: other
NFC: non-fiction literature SCI: scientific literature
PRO: professional literature
POP: popular literature
HUM: humanities ANT: anthropology, ethnography
THE: theatre, film, dance
PHI: philosophy, religion
HIS: history, biography
MUS: music
LAN: philology
INF: library and information science
ART: art, architecture
SSC: social sciences ECO: economy, business, logistics
POL: politics, military
LAW: law
PSY: psychology
SOC: sociology
REC: sports, recreation, hobbies
EDU: education
NAT: natural sciences BIO: biology
PHY: physics
GEO: geography, geology
CHE: chemistry
MED: medicine
AGR: agriculture
FTS: formal and technical sciences MAT: mathematics
TEC: technology
ICT: information technology
ITD : interdisciplinary ITD: interdisciplinary
MEM: memoirs, autobiographies MEM: memoirs, autobiographies MEM: memoirs, autobiographies
ADM: administrative ADM: administrative ADM: administrative
NMG: newspapers and magazines NEW: traditional journalistic texts X: other NTW: nationwide newspapers
REG: regional newspapers
LEI: leisure magazines X: other HOU: home, garden, hobbies
LIF: lifestyle
SCT: social life
SPO: sports
INT: curiosities
MIX: society

The classification of texts in SYN2015 is supplemented by some of their other characteristics. Each text newly has the medium attribute, which assigns to it one of the following values:

The share of journals vs. non-journals in the SYN2015 corpus.

In addition, we have created a new attribute which identifies the periodicity of the given publication and can have one of the following values:

In the audience attribute you can find information about the age of the text's intended reader: we differentiate among texts written for the general public (GEN) and children and adolescents (JUN).

Each text also newly contains information about the author's sex (authsex), or the translator's sex (transsex): female (F), male (M), not specified (X).

Of course, the metainformation available in previous corpora is also available here, namely title, author, translator, year of publication (pubyear), year of first publication (first_published), source language (srclang) and other characteristics.

The share of text types in the corpus

Although all categories are taken into account when making a balanced corpus, in order for the resulting corpus to be as varied as possible, the basic framework for determining the share of text types consists only of the categories txtype_group, txtype and genre_group. The proportions of the individual categories were selected rather pragmatically based on the texts which the CNC had at its disposal from publishers and other sources of texts.

txtype genre / genre_group category percentage
Fiction (FIC) 33.33 %
NOV novels 26 %
COL short stories 5 %
VER poetry 1 %
SCR drama 1 %
X other fiction 0.33 %
Non-fiction (NFC) 33.33 %
SCI/PRO/POP HUM humanities 7 %
SSC social sciences 7 %
NAT natural sciences 7 %
FTS formal and technical sciences 7 %
ITD interdisciplinary 1 %
MEM memoirs, autobiographies 4 %
ADM administrative texts 0.33 %
Newspapers and magazines (NMG) 33.33 %
NEW NTW national newspapers – specific (MF, LN, HN, Právo) 10 %
NTW national newspapers – other 5 %
REG regional newspapers 5 %
LEI leisure magazines 13.33 %

Václav Cvrček, Michal Křen, Anna Čermáková, Lucie Chlumská, Michal Škrabal, Dominika Kováříková