====== Overview of text classification in SYN2015 ====== Texts in the [[en:cnk:syn2015|SYN2015]] corpus are divided into three main groups ([[en:pojmy:txtype_group|txtype_group]]): - **FIC: fiction** - **NFC: non-fiction** - **NMG: newspapers and magazines** Each of these groups makes up one third of all texts in the corpus. ===== 1. Fiction ===== Changes with regard to the previous SYN series classification: * The **fiction** (FIC) category is further divided on the ''txtype'' level into novels and novellas (NOV), short story collections (COL), poetry (VER), drama and screenplays (SCR), and finally the category other (X), which cannot be placed in any of the above mentioned groups. For fiction, we have removed the explicit classification on the ''genre'' level, because fiction texts often tend to be either mixed genre or with no single defined genre; however, when selecting texts for the corpus, we did take into account an operative genre classification (e.g. detective novel, thriller/horror, sci-fi, fantasy, humour/satire etc.), in order to ensure that the selection of texts would be as varied as possible. The new division of fiction (FIC) on the [[en:pojmy:txtype|txtype]] level is: - **NOV: prose** – novels and novellas - **COL: shorter prose** – collections of short stories and other shorter prose texts (e.g. essays, blog entries etc.) - **VER: poetry** – collections of poetry, marginally song lyrics - **SCR: drama** - theatre plays, marginally also screenplays for film - **X: unclassified** – works which cannot be clearly assigned to one of the above mentioned categories (e.g. mixed genre texts, collections of aphorisms, anecdotes, etc.) ===== 2. Non-fiction ===== The most significant changes compared to the previous SYN series classification: * **Non-fiction** (previously scientific) **literature** (NFC) reflects a certain level of „proficiency“ and specialization of the target audience, and consists of three main types (''txtype''): scientific (SCI), professional (PRO) and popular (POP) literature. This macro-group should be understood as the opposite of fiction and journalistic texts: for this reason, it also contains administrative texts (ADM) in the broadest sense as well as a group of texts that are on the borderline between fiction and non-fiction, most typically memoirs and autobiographies (MEM). By changing the name of this group from //scientific// to the more general //non-fiction// we hope to achieve a more accurate representation of its heterogeneous contents, while the term //scientific// is now assigned only to academic texts (SCI). The newly defined category of professional literature (PRO) includes texts which are characterized by large quantities of practical information primarily intended for professionals in a given field. * Non-fiction literature newly contains an additional level for the SCI, PRO and POP txtype – ''genre_group'', which was created by grouping together individual disciplines or fields into larger categories and makes it possible to analyze texts from similar or related fields together: humanities (HUM), social sciences (SSC), natural sciences (NAT) and technical sciences (FTS). * On the ''genre'' level, which contains the most detailed classification and reflects each specific field or discipline, the individual texts were classified in a way that would most accurately correspond with the subject categorization used by the [[http://text.nkp.cz/o-knihovne/odborne-cinnosti/zpracovani-fondu/vecne-zpracovani-vecne-autority/material-kon2|National Library of the Czech Republic]]. The fields are featured in detail in the table below. Non-fiction literature (NFC) on the [[en:pojmy:txtype|txtype]] level is newly divided into: - **SCI: scientific literature** – scientific texts, including academic publications and university textbooks - **PRO: professional literature** – texts intended for professionals in a given field, including specialized periodicals (e.g. Logistika, Lékařské listy, Sestra, Zeměměřič, Stavitel, Konstrukce) - **POP: popular literature** – texts intended for a lay audience with an interest in the field (e.g. Bydlí s námi sladkovodní želva, Botanické zahrady a arboreta České republiky, Praktický houbař) - **ADM: administrative texts** – rules and regulations, meeting minutes, instructions and guidelines, annual reports, etc. - **MEM: memoirs, (auto)biographies** – memoirs, (auto)biographies (with the exception of fictionalized autobiographies, which are included in the fiction category), written correspondence (e.g. Bojoval jsem u Berlína, Chirurgovy poznámky, Meda Mládková - Můj úžasný život) === Genre_group === The NFC category contains a new layer of classification, **[[en:pojmy:genre_group|genre_group]]**, which is relevant for texts in the SCI, PRO and POP categories. It was created by grouping together the individual fields (labelled [[en:pojmy:genre|genre]] in the CNC) into larger groups: humanities (HUM), social sciences (SSC), natural sciences (NAT) and formal and technical sciences (FTS), please refer to the table below. On the ''genre'' level, in other words the most detailed level of text classification, the individual texts (with very few exceptions) were classified in compliance with the subject-based categorization which is used in the National Library of the Czech Republic. Ambiguous cases were resolved through the consensus of several classifiers. The fields are shown in detail in the table below. ^ HUM: humanities ^ SSC: social sciences ^ NAT: natural sciences ^ FTS: formal and technical sciences ^ ITD: interdisciplinary ^ | ANT: anthropology, ethnography\\ THE: theatre, film, dance\\ PHI: philosophy, religion\\ HIS: history\\ LAN: philology\\ INF: library and information science\\ ART: art, architecture | ECO: economy, business, logistics\\ POL: politics, military\\ LAW: law\\ PSY: psychology\\ SOC: sociology\\ REC: sports, recreation, hobbies\\ EDU: education | BIO: biology \\ PHY: physics\\ GEO: geography, geology\\ CHE: chemistry\\ MED: medicine\\ AGR: agriculture | MAT: mathematics\\ TEC: technology\\ ICT: information technology | ITD: interdisciplinary | ===== 3. Newspapers and magazines ===== The most significant changes compared to the previous SYN series classification: * The once monolithic category of **newspapers and magazines** (NMG) is now newly divided on the ''txtype'' level into the groups **traditional** (NEW) and **leisure** (LEI). Traditional newspapers (typically daily newspapers) are further divided on the ''genre'' level into the groups **national** (NTW) and **regional** (REG). Leisure magazines (mostly various types of special interest magazines) are also divided on the ''genre'' level into the following thematic groups: home, garden, hobbies (HOU), lifestyle (LIF), social life (SCT), sports (SPO), international curiosities (INT) and society (MIX). * Wherever possible, important journalistic titles published after 2010 are newly given a more detailed classification (on the level of individual articles) into thematic **sections** (the ''[[en:seznamy:section|text.section]]'' attribute): news (foreign, domestic, regional), politics, economy, sports, culture, leisure, commentaries, crime, social life and front page. [{{ :cnk:syn2015-lei-new.png?direct&450|Share of texts in the LEI and NEW categories per year.}}] The category of newspapers and magazines (NMG) on the [[en:pojmy:txtype|txtype]] level is newly divided into: - **NEW: traditional newspapers** (emphasis on current events, political news, news from home and abroad) - **LEI: leisure magazines** (predominantly special interest magazines) ==== Traditional newspapers (NEW) ==== The category of traditional newspapers (NEW) is divided on the [[en:pojmy:genre|genre]] level into: * **NTW: national** (e.g. Lidové noviny, Hospodářské noviny, Mladá fronta DNES, Právo, Respekt, Reflex) * **REG: regional** (e.g. Chrudimský zpravodaj, Kopřivnické noviny, Týnecké listy) ==== Leisure magazines (LEI) ==== The category of leisure magazines (LEI) is further divided on the [[en:pojmy:genre|genre]] level based on the topic: * **HOU: home, garden, hobbies** (e.g. Bydlení, Chatař & chalupář, Blesk Hobby, Dům a zahrada) * **LIF: lifestyle** (e.g. Marianne, Elle, JOY, Esprit, Žena a život, Kondice, Maxim, Vlasta) * **SCT: social life** (e.g. Blesk, Aha!, Story, Rytmus života) * **SPO: sports** (e.g. Sport, Nedělní sport, Sport magazín, Sport GÓÓÓL!) * **INT: international curiosities** (e.g. 100+1 zahraniční zajímavost, ABC, Lidé a země, Geo, National Geographic Česko) * **MIX: society** (e.g. Instinkt, Kraus, Květy, IN Magazín, Magazín Práva, Pátek Lidových novin) [{{ :cnk:syn2015-nmg-tituly.png?direct&500|The representation of major titles in the newspapers and magazines category.}}] === Sections === Selected periodicals (Mladá fronta Dnes, Právo, Hospodářské noviny, Lidové noviny, Deníky Bohemia, Týden, Deníky Moravia, Respekt, Regionální týdeník, Blesk, Dobrý den s kurýrem, Metro, E15, Jihlavské listy, Sedmička, Aha! neděle, Nedělní Blesk) furthermore offer information about the section in which the article was originally published. This information is contained in the [[en:seznamy:section|section]] attribute, which characterizes the structure of the '''' and has one of the following values: * current events * foreign news * domestic news * regional news * politics * economy * sports * culture * leisure * commentaries * crime * social life * front page ===== Overall classification ===== The following table offers a comprehensive summary of how texts are divided into categories based on the ''txtype_group'', ''txtype'', ''genre_group'' and ''genre'' attributes. ^ txtype_group ^ txtype ^ genre_group ^ genre ^ | FIC: fiction | NOV: novels | X: other | X: other | | ::: | COL: short stories | ::: | ::: | | ::: | VER: poetry | ::: | ::: | | ::: | SCR: drama, screenplays | ::: | ::: | | ::: | X: other | ::: | ::: | | NFC: non-fiction literature | SCI: scientific literature\\ PRO: professional literature\\ POP: popular literature | HUM: humanities | ANT: anthropology, ethnography| | ::: | ::: | ::: | THE: theatre, film, dance | | ::: | ::: | ::: | PHI: philosophy, religion | | ::: | ::: | ::: | HIS: history, biography | | ::: | ::: | ::: | MUS: music | | ::: | ::: | ::: | LAN: philology | | ::: | ::: | ::: | INF: library and information science | | ::: | ::: | ::: | ART: art, architecture | | ::: | ::: | SSC: social sciences | ECO: economy, business, logistics | | ::: | ::: | ::: | POL: politics, military | | ::: | ::: | ::: | LAW: law | | ::: | ::: | ::: | PSY: psychology | | ::: | ::: | ::: | SOC: sociology | | ::: | ::: | ::: | REC: sports, recreation, hobbies | | ::: | ::: | ::: | EDU: education | | ::: | ::: | NAT: natural sciences | BIO: biology | | ::: | ::: | ::: | PHY: physics | | ::: | ::: | ::: | GEO: geography, geology | | ::: | ::: | ::: | CHE: chemistry | | ::: | ::: | ::: | MED: medicine | | ::: | ::: | ::: | AGR: agriculture | | ::: | ::: | FTS: formal and technical sciences | MAT: mathematics | | ::: | ::: | ::: | TEC: technology | | ::: | ::: | ::: | ICT: information technology | | ::: | ::: | ITD : interdisciplinary | ITD: interdisciplinary | | ::: | MEM: memoirs, autobiographies | MEM: memoirs, autobiographies | MEM: memoirs, autobiographies | | ::: | ADM: administrative | ADM: administrative | ADM: administrative | | NMG: newspapers and magazines| NEW: traditional journalistic texts | X: other | NTW: nationwide newspapers | | ::: | ::: | ::: | REG: regional newspapers | | ::: | LEI: leisure magazines | X: other | HOU: home, garden, hobbies | | ::: | ::: | ::: | LIF: lifestyle | | ::: | ::: | ::: | SCT: social life | | ::: | ::: | ::: | SPO: sports | | ::: | ::: | ::: | INT: curiosities | | ::: | ::: | ::: | MIX: society| The classification of texts in SYN2015 is supplemented by some of their other characteristics. Each text newly has the [[en:seznamy:med|medium]] attribute, which assigns to it one of the following values: * B: book * J: journal * NWS: newspaper * OTH: other printed medium * REF: reference handbook * TXB: textbook [{{ :cnk:syn2015-periodicita.png?direct&250|The share of journals vs. non-journals in the SYN2015 corpus.}}] In addition, we have created a new attribute which identifies the [[en:seznamy:periodicity|periodicity]] of the given publication and can have one of the following values: * BI: less than monthly * DA: daily * MO: monthly * NP: non-periodical publication * WE: weekly, fortnightly In the [[en:seznamy:audience|audience]] attribute you can find information about the **age of the text's intended reader**: we differentiate among texts written for the general public (GEN) and children and adolescents (JUN). Each text also newly contains information about the **author's sex** ([[en:seznamy:authsex-transsex|authsex]]), or the **translator's sex** ([[en:seznamy:authsex-transsex|transsex]]): female (F), male (M), not specified (X). Of course, the metainformation available in previous corpora is also available here, namely ''title'', ''author'', ''translator'', year of publication (''pubyear''), year of first publication (''first_published''), source language (''[[en:seznamy:srclang|srclang]]'') and other characteristics. ===== The share of text types in the corpus===== Although all categories are taken into account when making a balanced corpus, in order for the resulting corpus to be as varied as possible, the basic framework for determining the share of text types consists only of the categories ''txtype_group'', ''txtype'' and ''genre_group''. The proportions of the individual categories were selected rather pragmatically based on the texts which the CNC had at its disposal from publishers and other sources of texts. ^ txtype ^ genre / genre_group ^ category ^ percentage ^ | **Fiction** (FIC) ||| 33.33 % | | NOV | | novels | 26 % | | COL | | short stories | 5 % | | VER | | poetry | 1 % | | SCR | | drama | 1 % | | X | | other fiction | 0.33 % | | **Non-fiction** (NFC) ||| 33.33 % | | SCI/PRO/POP | HUM | humanities | 7 % | | ::: | SSC | social sciences | 7 % | | ::: | NAT | natural sciences | 7 % | | ::: | FTS | formal and technical sciences | 7 % | | ::: | ITD | interdisciplinary | 1 % | | MEM | | memoirs, autobiographies | 4 % | | ADM | | administrative texts | 0.33 % | | **Newspapers and magazines** (NMG) ||| 33.33 % | | NEW | NTW | national newspapers – specific (MF, LN, HN, Právo) | 10 % | | ::: | NTW | national newspapers – other | 5 % | | ::: | REG | regional newspapers | 5 % | | LEI | | leisure magazines | 13.33 % | --- //Václav Cvrček, Michal Křen, Anna Čermáková, Lucie Chlumská, Michal Škrabal, Dominika Kováříková//