AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:uvod [2025/03/17 16:56] – [Corpora of the Czech National Corpus project] michalkrenen:cnk:uvod [2025/10/03 18:18] (current) – [Corpora of the Czech National Corpus project] michalkren
Line 48: Line 48:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:orator|ORATOR]] (version 2) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |+| [[en:cnk:orator|ORATOR]] (version 3) |  1.2M |  ✓  |  ✓  |  2019  | reference corpus of monologues with one-layer transcription |
 | [[en:cnk:ortofon|ORTOFON]] (version 3) |  2.4M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) | | [[en:cnk:ortofon|ORTOFON]] (version 3) |  2.4M |  ✓  |  ✓  |  2017  | reference representative corpus of informal spoken Czech with two-layer transcription (covers Bohemia, Moravia and Silesia) |
 | [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) | | [[en:cnk:oral|ORAL]] (version 1) |  5,4M |  ✓  |  ✓  |  2017  | reference corpus of informal spoken Czech (covers Bohemia, Moravia and Silesia) |
Line 65: Line 65:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC | | [[en:cnk:diakorp|DIAKORP]] (version 6) |  3.4M |  ✗  |  ✗  |  2005  | versioned corpus of the diachronic section of the CNC |
-| [[en:cnk:onomos|OnomOs]] |  200k |  ✓  |  ✓  |  2023  | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |+| [[en:cnk:onomos|OnomOs]] (version 2) |  400k |  ✓  |  ✓  |  2023  | corpus of selected issues of the (Rudé) Právo newspaper with named entity annotation |
 ^ <fs large>Foreign language corpora</fs> ^^^^^^ ^ <fs large>Foreign language corpora</fs> ^^^^^^
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
Line 86: Line 86:
 | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish | | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish |
 | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 | | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 |
 +^ <fs large>Corpora generated by large language models (LLMs)</fs> ^^^^^^
 +^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 +| [[en:cnk:aibrown|AI Brown]] |  27M |  ✓  |  ✓  |  2025  | multi-genre corpus of English texts produced by LLMs |
 +| [[en:cnk:aikoditex|AI Koditex]] |  21M |  ✓  |  ✓  |  2025  | multi-genre corpus of Czech texts produced by LLMs |
 +