AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:uvod [2025/07/16 21:56] – [Corpora of the Czech National Corpus project] michalkrenen:cnk:uvod [2026/01/23 10:19] (current) michalkren
Line 7: Line 7:
 ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^ ^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  released((For versioned corpora (e.g. [[en:cnk:syn|SYN]] or [[en:cnk:intercorp|InterCorp]]), the year when the first version was released is also stated.))  ^ characteristic features ^
 | **General corpora** |||||| | **General corpora** ||||||
-| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze13|version 13]]) |  5.3G |  ✓  |  ✓  |  2010–2024  | versioned corpus, unification of all the SYN-series synchronic written corpora | +| [[en:cnk:syn|SYN]] ([[en:cnk:syn:verze14|version 14]]) |  5.5G |  ✓  |  ✓  |  2010–2025  | versioned corpus, unification of all the SYN-series synchronic written corpora 
-| [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2014--2019 |+| [[en:cnk:syn2025|SYN2025]] |  100M |  ✓  |  ✓  |  2025  | reference representative corpus, most of the texts are from 2020--2024 
 +| [[en:cnk:syn2020|SYN2020]] |  100M |  ✓  |  ✓  |  2020  | reference representative corpus, most of the texts are from 2015--2019 |
 | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts | | [[en:cnk:syn2015|SYN2015]] |  100M |  ✓  |  ✓  |  2015  | reference representative corpus, most of the texts are from 2010--2014, with new classification of texts |
 | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 | | [[en:cnk:syn2013PUB|SYN2013PUB]] |  935M |  ✓  |  ✓  |  2013  | reference corpus of newspapers and magazines from 2005--2009 |
Line 86: Line 87:
 | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish | | [[en:cnk:nkjp|NKJP_1M]] |  1M |  ✓  |  ✓  |  2018  | manually annotated one-million subcorpus of the National Corpus of Polish |
 | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 | | [[en:cnk:obc|OBC]] |  24M |  ✗  |  ✓  |  2021  | [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]], trial proceedings from 1720--1913 |
 +^ <fs large>Corpora generated by large language models (LLMs)</fs> ^^^^^^
 +^ corpus ^ size (word count) ^  lemmas  ^ morphological tags ^  year  ^ characteristic features ^
 +| [[en:cnk:aibrown|AI Brown]] |  27M |  ✓  |  ✓  |  2025  | multi-genre corpus of English texts produced by LLMs |
 +| [[en:cnk:aikoditex|AI Koditex]] |  21M |  ✓  |  ✓  |  2025  | multi-genre corpus of Czech texts produced by LLMs |
 +