| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:cnk:uvod [2025/03/17 16:56] – [Corpora of the Czech National Corpus project] michalkren | en:cnk:uvod [2025/10/03 18:18] (current) – [Corpora of the Czech National Corpus project] michalkren |
|---|
| ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| | **General corpora** |||||| | | **General corpora** |||||| |
| | [[en:cnk:orator|ORATOR]] (version 2) | 1.2M | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription | | | [[en:cnk:orator|ORATOR]] (version 3) | 1.2M | ✓ | ✓ | 2019 | reference corpus of monologues with one-layer transcription | |
| | [[en:cnk:ortofon|ORTOFON]] (version 3) | 2.4M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:ortofon|ORTOFON]] (version 3) | 2.4M | ✓ | ✓ | 2017 | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | |
| | [[en:cnk:oral|ORAL]] (version 1) | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | | [[en:cnk:oral|ORAL]] (version 1) | 5,4M | ✓ | ✓ | 2017 | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | |
| ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| | [[en:cnk:diakorp|DIAKORP]] (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC | | | [[en:cnk:diakorp|DIAKORP]] (version 6) | 3.4M | ✗ | ✗ | 2005 | versioned corpus of the diachronic section of the CNC | |
| | [[en:cnk:onomos|OnomOs]] | 200k | ✓ | ✓ | 2023 | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation | | | [[en:cnk:onomos|OnomOs]] (version 2) | 400k | ✓ | ✓ | 2023 | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation | |
| ^ <fs large>Foreign language corpora</fs> ^^^^^^ | ^ <fs large>Foreign language corpora</fs> ^^^^^^ |
| ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| | [[en:cnk:nkjp|NKJP_1M]] | 1M | ✓ | ✓ | 2018 | manually annotated one-million subcorpus of the National Corpus of Polish | | | [[en:cnk:nkjp|NKJP_1M]] | 1M | ✓ | ✓ | 2018 | manually annotated one-million subcorpus of the National Corpus of Polish | |
| | [[en:cnk:obc|OBC]] | 24M | ✗ | ✓ | 2021 | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 | | | [[en:cnk:obc|OBC]] | 24M | ✗ | ✓ | 2021 | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 | |
| | ^ <fs large>Corpora generated by large language models (LLMs)</fs> ^^^^^^ |
| | ^ corpus ^ size (word count) ^ lemmas ^ morphological tags ^ year ^ characteristic features ^ |
| | | [[en:cnk:aibrown|AI Brown]] | 27M | ✓ | ✓ | 2025 | multi-genre corpus of English texts produced by LLMs | |
| | | [[en:cnk:aikoditex|AI Koditex]] | 21M | ✓ | ✓ | 2025 | multi-genre corpus of Czech texts produced by LLMs | |
| | |