AplikaceAplikace
Nastavení

This is an old revision of the document!


NET Corpus

Name NET
Positions Number of positions (tokens) 51 733 873
Number of word forms 1 245 717
Number of lemmas 750 650
Structures Number of documents <doc> 1 279
Number of texts <text> 267 026
Number of paragraphs <p> 267 026
Number of sentences <s> 2 622 636
Further Information Reference NO
Representative NO
Year of publication 2020

NET corpus is the first version of a synchronic corpus of Czech semi-official internet communication. The corpus is not representative in any way and it is currently composed of two parts: discussion forums and blogs. Data coverage shall increase in the future versions of NET. As one of the aims of NET is to map the selected areas of internet communication, NET tries to capture the selected domain from its beginning, and at the same time, it will concentrate also on its future content that will be included in future versions of the corpus, so that NET could capture its change over time.

Discussion forums

This part of the corpus is concentrated on discussion forums run on the phpBB platform. For the time being, there are neither commentaries / discussions on the published news articles nor social network data included in NET. The sampling of the phpBB platform has been random, the sample size is planned to be increased in the future.

Personal blogs

Personal blogs have been downloaded mostly from news servers and web magazines where they often form a supplementary part of the main web. There are no corporate or other formal blogs included in the NET corpus.

How to cite

Jeziorský, T.: NET: korpus polooficiální internetové komunikace. Ústav Českého národního korpusu FF UK, Praha 2019 dostupný z: https://www.korpus.cz.