ONLINE corpora

ONLINE corpora together create a monitor corpus of the dynamic content of the Czech web, i.e. predominantly internet journalism, to some extent also discussions, forums and social networks. The span of the corpus is since 2017 till the present.

The key feature of the ONLINE corpora are regular updates. This means that their contents change continually, and it is thus not possible to get back to previous versions of the corpora. Given that the input data (sources) can change, there is no guarantee that the structure as well as the annotation of the ONLINE corpus will remain the same. If you need an invariable reference corpus for the research of the specifics of internet communication, you can make use of the korpus NET corpus.

The corpus is annotated using standard tools for the morphological analysis and lemmatization of the SYN-series corpora. The annotation is thus comparable e.g. with the SYN2015 corpus.

Generations of ONLINE corpora

There are two generations of ONLINE corpora:

Generation Corpus name Period covered Composition Year of publication
1. ONLINE1 January 2017 – March 2021 online journalism, social media, discussions, forums 2020
2. ONLINE2_NOW, ONLINE2_ARCHIVE April 2021 – present online journalism 2022

The ONLINE corpora are disjunctive, i.e. there is no intersection. Therefore, for searching in the whole time period since 2017, the results of queries on both corpora can simply be joined together, no manual corrections are needed. As both corpora are identical in their structure and annotation, the following description does not distinguish between them.

Note on backwards compatibility:

Saved queries on the 1st generation ONLINE corpora (i.e. ONLINE_NOW and ONLINE_ARCHIVE) may not work after the 2nd generation is published (among other things due to change of corpus name). However, the ONLINE1 corpus contains all the texts of this previous generation and by replicating queries on it, it should be possible to arrive at the same results.