Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:online:gen2 [2022/12/22 12:36] – vaclavcvrcek | en:cnk:online:gen2 [2023/01/12 09:46] (current) – vaclavcvrcek |
---|
====== ONLINE2 (2nd generation) ====== | ====== ONLINE2 (2nd generation) ====== |
| |
ONLINE_NOW and ONLINE_ARCHIVE are two corpora which together create a monitor corpus of the dynamic content of the Czech web, i.e. internet journalism, discussions, forums and social networks. The span of the corpus is since 2017 till the present. It has been created at the CNC with the help of the data kindly provided by the [[https://www.dataweps.com|Dataweps]] company. | ONLINE2_NOW and ONLINE2_ARCHIVE are two corpora which together create a monitor corpus ([[en:cnk:online|ONLINE]]) of the dynamic content of the Czech web, i.e. internet journalism. The span of the corpus is since April 2021 till the present. It has been created at the CNC with the help of the data kindly provided by the [[https://monitora.cz/|Monitora]] company. |
| |
Both corpora differ in their extent and periodicity of updates: | Both corpora differ in their extent and periodicity of updates: |
* **ONLINE_NOW** -- contains daily updates from the current month plus 6 preceding months; updated daily | * **ONLINE2_NOW** -- contains daily updates from the current month plus 6 preceding months; updated daily |
* **ONLINE_ARCHIVE** -- contains data since Feb 2017 until the date when ONLINE_NOW begins; updated every month | * **ONLINE2_ARCHIVE** -- contains data since April 2021 until the date when ONLINE_NOW begins; updated every month |
| |
<WRAP right 35%> | <WRAP right 35%> |
^ <fs medium>Name</fs> ^^ <fs medium>ONLINE</fs> ^ | ^ <fs medium>Name</fs> ^^ <fs medium>ONLINE2</fs> ^ |
^ Size (as of Nov 2020) ^ Number of [[en:pojmy:token|tokens]] | 6.274 billion | | ^ Size (as of Dec 2022) ^ Number of tokens | 866 million | |
^ ::: ^ Number of sentences <s> | 506.6 million | | ^ ::: ^ Number of sentences <s> | 52.2 million | |
^ Additional information ^ [[en:pojmy:referencni|Reference]] | NO | | ^ Additional information ^ Reference | NO | |
^ ::: ^ [[en:pojmy:reprezentativnost|Representative]] | NO | | ^ ::: ^ Representative | NO | |
| ^ ::: ^ Period covered | since April 2021 | |
^ ::: ^ Year of publication | 2020 | | ^ ::: ^ Year of publication | 2020 | |
</WRAP> | </WRAP> |
| |
The ONLINE_NOW and ONLINE_ARCHIVE corpora are disjunctive, i.e. there is no intersection. Therefore, for searching in the whole time period since 2017, the results of queries on both corpora can simply be joined together, no manual corrections are needed. As both corpora are identical in their structure and annotation, the following description does not distinguish between them. | The ONLINE_NOW and ONLINE_ARCHIVE corpora (as well as the [[en:cnk:online:gen1|ONLINE1]]) are disjunctive, i.e. there is no intersection. Therefore, for searching in the whole time period since 2017, the results of queries on both corpora can simply be joined together, no manual corrections are needed. As both corpora are identical in their structure and annotation, the following description does not distinguish between them. |
| |
==== Updates ==== | ==== Updates ==== |
The key feature of the ONLINE corpora are regular updates. This means that their contents **change continually**, and it is thus not possible to get back to previous versions of the corpora. Given that the input data (sources) can change, there is no guarantee that the structure as well as the annotation of the ONLINE corpus will remain the same. If you need an invariable reference corpus for the research of the specifics of internet communication, you can make use of the [[en:cnk:net|korpus NET]] corpus. | The key feature of the ONLINE corpora are regular updates. This means that their contents **change continually**, and it is thus not possible to get back to previous versions of the corpora. Given that the input data (sources) can change, there is no guarantee that the structure as well as the annotation of the ONLINE corpus will remain the same. If you need an invariable reference corpus for the research of the specifics of internet communication, you can make use of the [[en:cnk:net|korpus NET]] corpus. |
| |
Updates of the ONLINE_NOW corpus take place **daily around 9:00 (CET)**, when the data from the previous day is added and published. The amount of the updates varies (depending on the size of the downloaded material) from 4 to 8 million tokens. On the first day of every month, the oldest month of the ONLINE_NOW corpus is moved to ONLINE_ARCHIVE. | Updates of the ONLINE2_NOW corpus take place **daily in the morning**, when the data from the previous day is added and published. The amount of the updates varies (depending on the size of the downloaded material) from 0.8 to 1.5 million tokens. On the first day of every month, the oldest month of the ONLINE2_NOW corpus is moved to ONLINE2_ARCHIVE. |
| |
Updates of the ONLINE_ARCHIVE corpus thus takes place **every month**, when there is a whole month removed from ONLINE_NOW and added to ONLINE_ARCHIVE (it is always the month that us actually a half year old). | Updates of the ONLINE2_ARCHIVE corpus thus takes place **every month**, when there is a whole month removed from ONLINE2_NOW and added to ONLINE2_ARCHIVE (it is always the month that us actually a half year old). |
| |
<fs smaller> | |
For instance, on Aug 25, ONLINE_NOW contains data from Feb 1 until Aug 24 (inclusive), i.e. all the days of the current month except for the current day + 6 whole preceding months. ONLINE_ARCHIVE contains all the older data up until Jan 31, i.e. by the date when ONLINE_NOW begins. A change will come on Sep 2, when the data from the whole February will be moved from ONLINE_NOW to ONLINE_ARCHIVE, and subsequently, the updated ONLINE_NOW corpus will contain data from Mar 1 until Sep 1 (inclusive). | For instance, on Aug 25, ONLINE_NOW contains data from Feb 1 until Aug 24 (inclusive), i.e. all the days of the current month except for the current day + 6 whole preceding months. ONLINE_ARCHIVE contains all the older data up until Jan 31, i.e. by the date when ONLINE_NOW begins. A change will come on Sep 2, when the data from the whole February will be moved from ONLINE_NOW to ONLINE_ARCHIVE, and subsequently, the updated ONLINE_NOW corpus will contain data from Mar 1 until Sep 1 (inclusive). |
</fs> | |
| |
| |
===== Corpus structure ===== | ===== Corpus structure ===== |
| |
Compared to the [[en:cnk:syn|SYN-series]] corpora of written Czech, the ONLINE corpus has several specific features. The data come from several sources (cf. the ''source'' attribute): | Compared to the [[en:cnk:syn|SYN-series]] corpora of written Czech, the ONLINE2 corpus has several specific features. For backward compatibility with the previous generation, we keep the ''source'' attribute in the data, which indicates what type of internet data it is. The second generation of the corpus consists solely of online journalism therefore the value of this attribute is always **news**. |
* **news** -- internet news | |
* **facebook** -- posts, including comments (the collection of facebook data is discontinued since December 2020) | |
* **twitter** -- posts, including comments | |
* **instagram** -- available only in certain periods | |
* **discussions** -- web discussions (under the individual articles on news servers) | |
* **forums** -- self-standing web forums (independent on news servers) | |
| |
These sources differ also in their processing. The **internet news** from one day are joined together into a single document (''<doc>'' structure) based on their original web-portal (''resource'' attribute). Within this structure, the individual articles are divided into separate structures (''<text>''). For instance, all the articles issued in one day at the [[https://zatecky.denik.cz/|zatecky.denik.cz]] portal are joined together into a single ''<doc>'' structure while keeping them in separate ''<text>'' structures. | Texts from one day are joined together into a single document (''<doc>'' structure) based on their original web-portal (''resource'' attribute). Within this structure, the individual articles are divided into separate structures (''<text>''). For instance, all the articles issued in one day at the [[https://burzovnisvet.cz/|burzovnisvet.cz]] portal are joined together into a single ''<doc>'' structure while keeping them in separate ''<text>'' structures. |
| |
All other sources are structured differently. All-day content of the whole source constitutes a single document (''<doc>''), i.e. one ''<doc>'' for discussions, one for forums and one for every social network. The individual contributions within these documents have separate ''<text>'' structure. | |
| |
| |
The text classification of the ONLINE corpora is based on [[en:cnk:klasifikace_textu_syn2015|the classification designed for SYN2015]], while enriching it with some additional attributes. Common attributes: [[en:seznamy:txtype_group|txtype_group]], [[en:seznamy:txtype|txtype]], [[en:seznamy:genre_group|genre_group]], [[en:seznamy:genre|genre]], [[en:seznamy:med|medium]], pubyear (publication year). Additional attributes are: date (when published), source, resource, resource_url, media_type and subject (text title). | The text classification of the ONLINE corpora is based on [[en:cnk:klasifikace_textu_syn2015|the classification designed for SYN2015]], while enriching it with some additional attributes. Common attributes: [[en:seznamy:txtype_group|txtype_group]], [[en:seznamy:txtype|txtype]], [[en:seznamy:genre_group|genre_group]], [[en:seznamy:genre|genre]], [[en:seznamy:med|medium]], pubyear (publication year). Additional attributes are: date (when published), source, resource, resource_url, media_type and subject (text title). |
| |
==== source ==== | |
| |
Source of the data -- general classification that distinguishes news from discussions and social networks (see above). | |
| |
==== resource ==== | ==== resource ==== |
| |
More detailed specification of the source, typically a web portal (concrete URL is given in ''resource_url'' that is available as an attribute of the ''text'' structures). The value of ''resource'' depends on the ''source'': | More detailed specification of the source, typically a web portal, a concrete URL is given in ''text_url'' that is available as an attribute of the ''text'' structures. |
| |
* in the case of //news// (within ''<doc>''): original web portal or its part, e.g. //blesk-cz//, //seznamzpravy// etc. | |
* in the case of //social networks// (within ''<text>''): author (possibly a username) of the indiviual post or comment | |
* in the case of //discussions// (within ''<text>''): original news portal of the discussion, e.g. //novinky//, //zpravy.aktualne-cz// | |
* in the case of //web forums// (within ''<text>''): original web forum, e.g. //diskuze.modnipeklo-cz//, //emimino// | |
| |
| |
==== media_type ==== | ==== media_type ==== |
| |
The ''media_type'' attribute is relevant only for the web news (source: ''news'') and it gives their classification based on the typology elaborated by the team of J. Šlerka within the [[http://www.mapamedii.cz|Media map]] project. The classification is based on the readers' preferences by joining together news portals with similar audience (see [[http://www.mapamedii.cz/mapa/typologie/index.php|detailed description of the method]]). For the ONLINE corpus, the original classification has been enriched by some rather marginal categories and it distinguishes the following types: | The ''media_type'' attribute is relevant only for the web news (source: ''news'') and it gives their classification based on the typology elaborated by the team of J. Šlerka within the [[http://www.mapamedii.cz|Media map]] project. The classification is based on the readers' preferences by joining together news portals with similar audience. For the ONLINE corpus, the original classification has been enriched by some rather marginal categories and it distinguishes the following types: |
| |
* Analyticko-investigativní (analytical-investigative) | * Analyticko-investigativní (analytical-investigative) |
* Stranické weby (party sites) | * Stranické weby (party sites) |
* Web instituce (institution sites) | * Web instituce (institution sites) |
| |
| ==== duplicate ==== |
| |
| The ''text.duplicate'' attribute (available only in Generation 2) indicates whether a text is a duplicate of another text in the corpus. This situation can happen quite often with online media as a result of adopting news between news agencies and individual portals. If we want to avoid the bias introduced by such text duplicates, we can use a ''within'' condition (e.g., ''%%[word="round"] within <text duplicate!="no" />%%''), which causes that duplicate texts appear in the result only once. |
| |
| |
| |
| |
<WRAP round tip 70%> | <WRAP round tip 70%> |
Cvrček, V. – Procházka, P.: //ONLINE_NOW: monitorovací korpus internetové češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020 [cit. YYYY-MM-DD((Concrete day in the year-month-day format, e.g. 2020-10-02.))]. Available from: http://www.korpus.cz | Cvrček, V. – Jeziorský, T. – Henyš, J.: //ONLINE2_NOW: monitoring corpus of online Czech//. Ústav Českého národního korpusu FF UK, Praha 2022 [cit. YYYY-MM-DD((Concrete day in the year-month-day format, e.g. 2020-10-02.))]. Available from: http://www.korpus.cz |
| |
Cvrček, V. – Procházka, P.: //ONLINE_ARCHIVE: monitorovací korpus internetové češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020 [accessed YYYY-MM-DD]. Available from: http://www.korpus.cz | Cvrček, V. – Jeziorský, T. – Henyš, J.: //ONLINE2_ARCHIVE: monitoring corpus of online Czech//. Ústav Českého národního korpusu FF UK, Praha 2022 [accessed YYYY-MM-DD]. Available from: http://www.korpus.cz |
</WRAP> | </WRAP> |
| |