Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:online:gen2 [2022/12/22 14:23] – [Corpus structure] vaclavcvrcek | en:cnk:online:gen2 [2023/01/12 09:46] (current) – vaclavcvrcek |
---|
====== ONLINE2 (2nd generation) ====== | ====== ONLINE2 (2nd generation) ====== |
| |
ONLINE2_NOW and ONLINE2_ARCHIVE are two corpora which together create a monitor corpus ([[en:cnk:online|ONLINE]]) of the dynamic content of the Czech web, i.e. internet journalism. The span of the corpus is since April 2021 till the present. It has been created at the CNC with the help of the data kindly provided by the [[https://monitora.cz/|Mopnitora]] company. | ONLINE2_NOW and ONLINE2_ARCHIVE are two corpora which together create a monitor corpus ([[en:cnk:online|ONLINE]]) of the dynamic content of the Czech web, i.e. internet journalism. The span of the corpus is since April 2021 till the present. It has been created at the CNC with the help of the data kindly provided by the [[https://monitora.cz/|Monitora]] company. |
| |
Both corpora differ in their extent and periodicity of updates: | Both corpora differ in their extent and periodicity of updates: |
The text classification of the ONLINE corpora is based on [[en:cnk:klasifikace_textu_syn2015|the classification designed for SYN2015]], while enriching it with some additional attributes. Common attributes: [[en:seznamy:txtype_group|txtype_group]], [[en:seznamy:txtype|txtype]], [[en:seznamy:genre_group|genre_group]], [[en:seznamy:genre|genre]], [[en:seznamy:med|medium]], pubyear (publication year). Additional attributes are: date (when published), source, resource, resource_url, media_type and subject (text title). | The text classification of the ONLINE corpora is based on [[en:cnk:klasifikace_textu_syn2015|the classification designed for SYN2015]], while enriching it with some additional attributes. Common attributes: [[en:seznamy:txtype_group|txtype_group]], [[en:seznamy:txtype|txtype]], [[en:seznamy:genre_group|genre_group]], [[en:seznamy:genre|genre]], [[en:seznamy:med|medium]], pubyear (publication year). Additional attributes are: date (when published), source, resource, resource_url, media_type and subject (text title). |
| |
==== source ==== | |
| |
Source of the data -- general classification that distinguishes news from discussions and social networks (see above). | |
| |
==== resource ==== | ==== resource ==== |
| |
More detailed specification of the source, typically a web portal (concrete URL is given in ''resource_url'' that is available as an attribute of the ''text'' structures). The value of ''resource'' depends on the ''source'': | More detailed specification of the source, typically a web portal, a concrete URL is given in ''text_url'' that is available as an attribute of the ''text'' structures. |
| |
* in the case of //news// (within ''<doc>''): original web portal or its part, e.g. //blesk-cz//, //seznamzpravy// etc. | |
* in the case of //social networks// (within ''<text>''): author (possibly a username) of the indiviual post or comment | |
* in the case of //discussions// (within ''<text>''): original news portal of the discussion, e.g. //novinky//, //zpravy.aktualne-cz// | |
* in the case of //web forums// (within ''<text>''): original web forum, e.g. //diskuze.modnipeklo-cz//, //emimino// | |
| |
| |
==== media_type ==== | ==== media_type ==== |
| |
The ''media_type'' attribute is relevant only for the web news (source: ''news'') and it gives their classification based on the typology elaborated by the team of J. Šlerka within the [[http://www.mapamedii.cz|Media map]] project. The classification is based on the readers' preferences by joining together news portals with similar audience (see [[http://www.mapamedii.cz/mapa/typologie/index.php|detailed description of the method]]). For the ONLINE corpus, the original classification has been enriched by some rather marginal categories and it distinguishes the following types: | The ''media_type'' attribute is relevant only for the web news (source: ''news'') and it gives their classification based on the typology elaborated by the team of J. Šlerka within the [[http://www.mapamedii.cz|Media map]] project. The classification is based on the readers' preferences by joining together news portals with similar audience. For the ONLINE corpus, the original classification has been enriched by some rather marginal categories and it distinguishes the following types: |
| |
* Analyticko-investigativní (analytical-investigative) | * Analyticko-investigativní (analytical-investigative) |
* Stranické weby (party sites) | * Stranické weby (party sites) |
* Web instituce (institution sites) | * Web instituce (institution sites) |
| |
| ==== duplicate ==== |
| |
| The ''text.duplicate'' attribute (available only in Generation 2) indicates whether a text is a duplicate of another text in the corpus. This situation can happen quite often with online media as a result of adopting news between news agencies and individual portals. If we want to avoid the bias introduced by such text duplicates, we can use a ''within'' condition (e.g., ''%%[word="round"] within <text duplicate!="no" />%%''), which causes that duplicate texts appear in the result only once. |
| |
| |
| |
| |
<WRAP round tip 70%> | <WRAP round tip 70%> |
Cvrček, V. – Procházka, P.: //ONLINE_NOW: monitorovací korpus internetové češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020 [cit. YYYY-MM-DD((Concrete day in the year-month-day format, e.g. 2020-10-02.))]. Available from: http://www.korpus.cz | Cvrček, V. – Jeziorský, T. – Henyš, J.: //ONLINE2_NOW: monitoring corpus of online Czech//. Ústav Českého národního korpusu FF UK, Praha 2022 [cit. YYYY-MM-DD((Concrete day in the year-month-day format, e.g. 2020-10-02.))]. Available from: http://www.korpus.cz |
| |
Cvrček, V. – Procházka, P.: //ONLINE_ARCHIVE: monitorovací korpus internetové češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020 [accessed YYYY-MM-DD]. Available from: http://www.korpus.cz | Cvrček, V. – Jeziorský, T. – Henyš, J.: //ONLINE2_ARCHIVE: monitoring corpus of online Czech//. Ústav Českého národního korpusu FF UK, Praha 2022 [accessed YYYY-MM-DD]. Available from: http://www.korpus.cz |
</WRAP> | </WRAP> |
| |