This is an old revision of the document!

ONLINE corpora

ONLINE_NOW and ONLINE_ARCHIVE are two corpora which together create a monitor corpus of the dynamic content of the Czech web, i.e. internet journalism, discussions, forums and social networks. The span of the corpus is since 2017 till the present. It has been created at the CNC with the help of the data kindly provided by the Dataweps company.

Both corpora differ in their extent and periodicity of updates:

ONLINE_NOW – contains data from the current month plus 6 preceding months; updated daily
ONLINE_ARCHIVE – contains data since Feb 2017 until the date when ONLINE_NOW begins; updated every month

Name		ONLINE
Size (as of Nov 2020)	Number of tokens	6.274 billion
Size (as of Nov 2020)	Number of sentences <s>	506.6 million
Additional information	Reference	NO
	Representative	NO
	Year of publication	2020

The ONLINE_NOW and ONLINE_ARCHIVE corpora are disjunctive, i.e. there is no intersection. Therefore, for searching in the whole time period since 2017, the results of queries on both corpora can simply be joined together, without any manual corrections. As both corpora are identical in their structure and annotation, the following description does not distinguish between them.

Updates

The key feature of the ONLINE corpora are regular updates. This means that their contents change continually, and it is thus not possible to get back to previous versions of the corpora. Given that the input data (sources) can change, there is no guarantee that the structure and annotation of the ONLINE corpus will remain the same. If you need an invariable reference corpus for the research of the specifics of internet communication, you can make use of the korpus NET corpus.

Updates of the ONLINE_NOW corpus take place daily around 9:00 (CET), when the data from the previous day are added and published. The amount of the updates varies (depending on the size of the downloaded material) from 4 to 8 million tokens. On the first day of every month, the oldest month of the ONLINE_NOW corpus is moved to ONLINE_ARCHIVE.

Updates of the ONLINE_ARCHIVE corpus thus takes place every month, when there is a whole month removed from ONLINE_NOW and added to ONLINE_ARCHIVE (it is always the month that us actually a half year old).

For instance, on Aug 25, ONLINE_NOW contains data from Feb 1 until Aug 24 (inclusive), i.e. all the days of the current month except for the current day + 6 whole preceding months. ONLINE_ARCHIVE contains all the older data up until Jan 31, i.e. until the moment when ONLINE_NOW begins. A change will come on Sep 2, when the data from the whole February will be moved from ONLINE_NOW to ONLINE_ARCHIVE, and subsequently, the updated ONLINE_NOW corpus will contain data from Mar 1 until Sep 1 (inclusive).

Corpus structure

Compared to the SYN-series corpora of written Czech, the ONLINE corpus has several specific features. The data come from several sources (source attribute):

news – internet news
facebook – posts, including comments
twitter – posts, including comments
instagram – available only in certain periods
discussions – web discussions (under the individual articles on news servers)
forums – self-standing web forums (independent on news servers)

These sources differ also in their processing. The internet news (news) from one day are joined together into a single document (<doc> structure) based on their original source (resource attribute). Within this structure, the individual articles are divided into separate structures (<text>). For instance, all the articles issued in one day at the zatecky.denik.cz portal are joined together into a single <doc> structure while keeping them in separate <text> structures.

All other sources are structured differently. Every day constitutes a single document (<doc>) for the whole source, i.e. one <doc> for discussions, one for forums and one for every social network. The individual contributions within these documents have separate <text> structure.

Text classification

The text classification of the ONLINE corpora is based on the classification designed for SYN2015, while enriching it with some additional attributes. Common attributes: txtype_group, txtype, genre_group, genre, medium, pubyear (publication year). Additional attributes: date (when published), source, resource, resource_url, media_type and subject (text title).

source

Source of the data – general classification that distinguishes news from discussions and social networks (see above).

resource

Atribut zachycuje přesnější určení zdroje textu (typicky portál), konkrétní URL vedoucí přímo ke zdrojovému textu je pak uvedena v atributu resource_url, který je k dispozici u jednotlivých struktur úroveň text. Jeho hodnota atributu se liší u různých zdrojů dat.

v případě žurnalistiky (v rámci struktury <doc>): určení zdrojového portálu či jeho části, např. blesk-cz, seznamzpravy atp.
v případě sociálních sítí (v rámci struktury <text>): určení autora příspěvku, resp. jeho uživatelského jména
v případě diskusí (v rámci struktury <text>): určení výchozího zpravodajského portálu, v jehož rámci se diskuse vede, např. novinky, zpravy.aktualne-cz
v případe fór (v rámci struktury <text>): určení portálu, např. diskuze.modnipeklo-cz, emimino

media_type

Atribut media_type je relevantní pouze pro webovou žurnalistiku (source: news), kde poskytuje klasifikaci webových portálů na základě typologie vypracované týmem J. Šlerky v rámci projektu Mapa medií. Klasifikace je vytvořena na základě preferencí čtenářů, kdy do jedné skupiny jsou sdruženy ty portály, které mají podobné publikum (viz podrobný popis metody). Původní klasifikace byla pro účely značkování korpusu ONLINE obohacena o některé okrajové typy a počítá s následujícími položkami:

Analyticko-investigativní
Antisystémové weby
Bulvární media
Hlavní proud
Market-driven media
Názorové deníky
Ostatní
Politický bulvár
Stranické weby
Web instituce

Anotace

Korpus je značkován standardními nástroji pro morfologickou analýzu a lemmatizaci korpusů řady SYN. Výsledky analýzy by měly být srovnatelné s korpusem SYN2015 (viz popis morfologických značek).

Jak citovat korpusy ONLINE

Cvrček, V. – Procházka, P.: ONLINE_NOW: monitorovací korpus internetové češtiny. Ústav Českého národního korpusu FF UK, Praha 2020 [cit. RRRR-MM-DD¹⁾]. Dostupný z WWW: http://www.korpus.cz

Cvrček, V. – Procházka, P.: ONLINE_ARCHIVE: monitorovací korpus internetové češtiny. Ústav Českého národního korpusu FF UK, Praha 2020 [cit. RRRR-MM-DD]. Dostupný z WWW: http://www.korpus.cz

¹⁾

Konkrétní časový údaj v pořadí rok-měsíc-den, např. 2020-10-02

Trace: • fictree • czesl-man • korpusdb • statistiky_ke_korpusu_skript2012 • aikoditex • lemtag_mluv • kh-dopisy • site_notice • online