Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:online:gen2 [2022/12/22 15:29] – [media_type] vaclavcvrcek | en:cnk:online:gen2 [2023/01/12 09:46] (current) – vaclavcvrcek |
---|
====== ONLINE2 (2nd generation) ====== | ====== ONLINE2 (2nd generation) ====== |
| |
ONLINE2_NOW and ONLINE2_ARCHIVE are two corpora which together create a monitor corpus ([[en:cnk:online|ONLINE]]) of the dynamic content of the Czech web, i.e. internet journalism. The span of the corpus is since April 2021 till the present. It has been created at the CNC with the help of the data kindly provided by the [[https://monitora.cz/|Mopnitora]] company. | ONLINE2_NOW and ONLINE2_ARCHIVE are two corpora which together create a monitor corpus ([[en:cnk:online|ONLINE]]) of the dynamic content of the Czech web, i.e. internet journalism. The span of the corpus is since April 2021 till the present. It has been created at the CNC with the help of the data kindly provided by the [[https://monitora.cz/|Monitora]] company. |
| |
Both corpora differ in their extent and periodicity of updates: | Both corpora differ in their extent and periodicity of updates: |
==== duplicate ==== | ==== duplicate ==== |
| |
Atribut ''text.duplicate'' (dostupný pouze v 2. generaci) udává, zda je text duplikátem jiného textu v korpusu. Taková situace se u dat tohoto typu stává poměrně často v důsledku přejímání zpráv mezi tiskovými agenturami a jednotlivými tituly. Pokud se chceme vyhnout zkreslení, které je dáno takovýmito textovými duplicitami, můžeme použít dotaz s podmínkou [[pojmy:within|within]], která zaručí, že se ve výsledku objeví duplicitní texty jenom v jednou. | The ''text.duplicate'' attribute (available only in Generation 2) indicates whether a text is a duplicate of another text in the corpus. This situation can happen quite often with online media as a result of adopting news between news agencies and individual portals. If we want to avoid the bias introduced by such text duplicates, we can use a ''within'' condition (e.g., ''%%[word="round"] within <text duplicate!="no" />%%''), which causes that duplicate texts appear in the result only once. |
| |
| |