AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:online:gen1 [2022/12/22 12:35] vaclavcvrceken:cnk:online:gen1 [2022/12/22 14:13] (current) – [ONLINE1 (1st generation)] vaclavcvrcek
Line 2: Line 2:
 ====== ONLINE1 (1st generation) ====== ====== ONLINE1 (1st generation) ======
  
-ONLINE_NOW and ONLINE_ARCHIVE are two corpora which together create a monitor corpus of the dynamic content of the Czech web, i.e. internet journalism, discussions, forums and social networks. The span of the corpus is since 2017 till the present. It has been created at the CNC with the help of the data kindly provided by the [[https://www.dataweps.com|Dataweps]] company. +Monitor corpus ONLINE1 (1st generation) strives to map the dynamic content of the Czech web, i.e. internet journalism, discussions, forums and social networks. The span of the corpus is since 2017 till March 2021. It has been created at the CNC with the help of the data kindly provided by the [[https://www.dataweps.com|Dataweps]] company. ONLINE1 is no longer being updated, following time period is covered by the [[en:cnk:online:gen2|2nd generation: ONLINE2]].
- +
-Both corpora differ in their extent and periodicity of updates: +
-  * **ONLINE_NOW** -- contains daily updates from the current month plus 6 preceding months; updated daily +
-  * **ONLINE_ARCHIVE** -- contains data since Feb 2017 until the date when ONLINE_NOW begins; updated every month+
  
 <WRAP right 35%> <WRAP right 35%>
-^ <fs medium>Name</fs> ^^ <fs medium>ONLINE</fs> ^ +^ <fs medium>Name</fs> ^^ <fs medium>ONLINE1</fs> ^ 
-^ Size (as of Nov 2020) ^ Number of [[en:pojmy:token|tokens]] |  6.274 billion |   +^ Size ^ Number of tokens |  7.053 billion |   
-^ ::: ^ Number of sentences <s> |  506.6 million | +^ ::: ^ Number of sentences <s> |  563 million | 
-^ Additional information ^ [[en:pojmy:referencni|Reference]] |  NO |   +^ Additional information ^ Reference |  NO |   
-^ ::: ^ [[en:pojmy:reprezentativnost|Representative]] |  NO |  +^ ::: ^ Representative |  NO |   
 +^ ::: ^ Period covered |  1/2017 – 3/2021 |
 ^ ::: ^ Year of publication |  2020 | ^ ::: ^ Year of publication |  2020 |
 </WRAP> </WRAP>
  
-The ONLINE_NOW and ONLINE_ARCHIVE corpora are disjunctive, i.e. there is no intersection. Therefore, for searching in the whole time period since 2017, the results of queries on both corpora can simply be joined together, no manual corrections are needed. As both corpora are identical in their structure and annotation, the following description does not distinguish between them. 
  
-==== Updates ==== 
  
-The key feature of the ONLINE corpora are regular updates. This means that their contents **change continually**, and it is thus not possible to get back to previous versions of the corpora. Given that the input data (sources) can change, there is no guarantee that the structure as well as the annotation of the ONLINE corpus will remain the same. If you need an invariable reference corpus for the research of the specifics of internet communication, you can make use of the [[en:cnk:net|korpus NET]] corpus.+===== Corpus structure =====
  
-Updates of the ONLINE_NOW corpus take place **daily around 9:00 (CET)**when the data from the previous day is added and publishedThe amount of the updates varies (depending on the size of the downloaded materialfrom 4 to 8 million tokens. On the first day of every month, the oldest month of the ONLINE_NOW corpus is moved to ONLINE_ARCHIVE.+Compared to the [[en:cnk:syn|SYN-series]] corpora of written Czech, the ONLINE1 corpus has several specific features. The data come from several sources (cf. the ''source'' attribute):
  
-Updates of the ONLINE_ARCHIVE corpus thus takes place **every month**, when there is a whole month removed from ONLINE_NOW and added to ONLINE_ARCHIVE (it is always the month that us actually a half year old). 
- 
-<fs smaller> 
-For instance, on Aug 25, ONLINE_NOW contains data from Feb 1 until Aug 24 (inclusive), i.e. all the days of the current month except for the current day + 6 whole preceding months. ONLINE_ARCHIVE contains all the older data up until Jan 31, i.e. by the date when ONLINE_NOW begins. A change will come on Sep 2, when the data from the whole February will be moved from ONLINE_NOW to ONLINE_ARCHIVE, and subsequently, the updated ONLINE_NOW corpus will contain data from Mar 1 until Sep 1 (inclusive). 
-</fs>  
- 
- 
-===== Corpus structure ===== 
- 
-Compared to the [[en:cnk:syn|SYN-series]] corpora of written Czech, the ONLINE corpus has several specific features. The data come from several sources (cf. the ''source'' attribute): 
   * **news** -- internet news   * **news** -- internet news
   * **facebook** -- posts, including comments (the collection of facebook data is discontinued since December 2020)   * **facebook** -- posts, including comments (the collection of facebook data is discontinued since December 2020)
Line 90: Line 75:
  
 <WRAP round tip 70%> <WRAP round tip 70%>
-Cvrček, V. – Procházka, P.: //ONLINE_NOWmonitorovací korpus internetové češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020 [cit. YYYY-MM-DD((Concrete day in the year-month-day format, e.g. 2020-10-02.))]. Available from: http://www.korpus.cz +Cvrček, V. – Procházka, P.: //ONLINE1monitoring corpus of online Czech//. Ústav Českého národního korpusu FF UK, Praha 2020. Available from: http://www.korpus.cz
- +
-Cvrček, V. – Procházka, P.: //ONLINE_ARCHIVE: monitorovací korpus internetové češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020 [accessed YYYY-MM-DD]. Available from: http://www.korpus.cz+
 </WRAP> </WRAP>