~~NOTOC~~ ====== ONLINE1 (1st generation) ====== Monitor corpus ONLINE1 (1st generation) strives to map the dynamic content of the Czech web, i.e. internet journalism, discussions, forums and social networks. The span of the corpus is since 2017 till March 2021. It has been created at the CNC with the help of the data kindly provided by the [[https://www.dataweps.com|Dataweps]] company. ONLINE1 is no longer being updated, following time period is covered by the [[en:cnk:online:gen2|2nd generation: ONLINE2]]. ^ Name ^^ ONLINE1 ^ ^ Size ^ Number of tokens | 7.053 billion | ^ ::: ^ Number of sentences | 563 million | ^ Additional information ^ Reference | NO | ^ ::: ^ Representative | NO | ^ ::: ^ Period covered | 1/2017 – 3/2021 | ^ ::: ^ Year of publication | 2020 | ===== Corpus structure ===== Compared to the [[en:cnk:syn|SYN-series]] corpora of written Czech, the ONLINE1 corpus has several specific features. The data come from several sources (cf. the ''source'' attribute): * **news** -- internet news * **facebook** -- posts, including comments (the collection of facebook data is discontinued since December 2020) * **twitter** -- posts, including comments * **instagram** -- available only in certain periods * **discussions** -- web discussions (under the individual articles on news servers) * **forums** -- self-standing web forums (independent on news servers) These sources differ also in their processing. The **internet news** from one day are joined together into a single document ('''' structure) based on their original web-portal (''resource'' attribute). Within this structure, the individual articles are divided into separate structures (''''). For instance, all the articles issued in one day at the [[https://zatecky.denik.cz/|zatecky.denik.cz]] portal are joined together into a single '''' structure while keeping them in separate '''' structures. All other sources are structured differently. All-day content of the whole source constitutes a single document (''''), i.e. one '''' for discussions, one for forums and one for every social network. The individual contributions within these documents have separate '''' structure. ===== Text classification ===== The text classification of the ONLINE corpora is based on [[en:cnk:klasifikace_textu_syn2015|the classification designed for SYN2015]], while enriching it with some additional attributes. Common attributes: [[en:seznamy:txtype_group|txtype_group]], [[en:seznamy:txtype|txtype]], [[en:seznamy:genre_group|genre_group]], [[en:seznamy:genre|genre]], [[en:seznamy:med|medium]], pubyear (publication year). Additional attributes are: date (when published), source, resource, resource_url, media_type and subject (text title). ==== source ==== Source of the data -- general classification that distinguishes news from discussions and social networks (see above). ==== resource ==== More detailed specification of the source, typically a web portal (concrete URL is given in ''resource_url'' that is available as an attribute of the ''text'' structures). The value of ''resource'' depends on the ''source'': * in the case of //news// (within ''''): original web portal or its part, e.g. //blesk-cz//, //seznamzpravy// etc. * in the case of //social networks// (within ''''): author (possibly a username) of the indiviual post or comment * in the case of //discussions// (within ''''): original news portal of the discussion, e.g. //novinky//, //zpravy.aktualne-cz// * in the case of //web forums// (within ''''): original web forum, e.g. //diskuze.modnipeklo-cz//, //emimino// ==== media_type ==== The ''media_type'' attribute is relevant only for the web news (source: ''news'') and it gives their classification based on the typology elaborated by the team of J. Šlerka within the [[http://www.mapamedii.cz|Media map]] project. The classification is based on the readers' preferences by joining together news portals with similar audience (see [[http://www.mapamedii.cz/mapa/typologie/index.php|detailed description of the method]]). For the ONLINE corpus, the original classification has been enriched by some rather marginal categories and it distinguishes the following types: * Analyticko-investigativní (analytical-investigative) * Antisystémové weby (anti-system media) * Bulvární media (tabloid media) * Hlavní proud (mainstream) * Market-driven media (market-driven media) * Názorové deníky (opinion-based media) * Ostatní (other) * Politický bulvár (political tabloids) * Stranické weby (party sites) * Web instituce (institution sites) ===== Annotation ===== The corpus is annotated using standard tools for the [[en:pojmy:morfologicka_analyza|morphological analysis]] and [[en:pojmy:lemma|lemmatization]] of the SYN-series corpora. The annotation is thus comparable e.g. with the [[en:cnk:syn2015|SYN2015]] corpus. ====== How to cite ONLINE ====== Cvrček, V. – Procházka, P.: //ONLINE1: monitoring corpus of online Czech//. Ústav Českého národního korpusu FF UK, Praha 2020. Available from: http://www.korpus.cz