AplikaceAplikace
Nastavení

Rozdíly

Zde můžete vidět rozdíly mezi vybranou verzí a aktuální verzí dané stránky.

Odkaz na výstup diff

Obě strany předchozí revizePředchozí verze
Následující verze
Předchozí verze
cnk:aikoditex [2025/06/27 12:14] – [Corpus preparation] annamarklovacnk:aikoditex [2025/10/13 14:08] (aktuální) – [How to cite AI-Koditex] jirimilicka
Řádek 1: Řádek 1:
 ~~NOTOC~~ ~~NOTOC~~
-====== Korpus Koditex ======+====== AI-Koditex ======
  
 AI-Koditex is a generated, annotated, multi-genre corpus of Czech texts produced by large language models (LLMs).  AI-Koditex is a generated, annotated, multi-genre corpus of Czech texts produced by large language models (LLMs). 
- 
  
 <WRAP right 35%> <WRAP right 35%>
-^ <fs medium>Name</fs> ^^ <fs medium>Koditex</fs>+^ <fs medium>Name</fs> ^^ <fs medium>AI-Koditex v1</fs> ^ 
-^ Positions ^ Number of positions (tokens) |  24 586 730 |   +^ Positions ^ Number of positions (tokens) |  24 030 795 |   
-^ ::: ^ Number of positions (excl. punctuation) |  20 658 912 +^ ::: ^ Number of positions (excl. punctuation) |  20 180 737 
-^ ::: ^ Number of word forms (excl. punctuation) |  372 685 |   +^ ::: ^ Number of word forms (excl. punctuation) |  371 655  |   
-^ ::: ^ Number of lemmas (excl. punctuation) |  223 927 +^ ::: ^ Number of lemmas (excl. punctuation) |  223 122  
-^ Further information ^ Number of subcorpora |  31 |  +^ Further information ^ Number of subcorpora |  30 |  
 ^ ::: ^ Number of models |  15 |   ^ ::: ^ Number of models |  15 |  
 ^ ::: ^ Publication year |  2025 | ^ ::: ^ Publication year |  2025 |
 </WRAP> </WRAP>
  
-Modeled on the original [[en:cnk:koditex|corpus Koditex]], AI-Koditex was created to mirror its structure, genre balance, and linguistic diversity, enabling direct comparisons between human and machine-generated Czech texts. The corpus includes texts generated by 13 frontier LLMs from major developers (OpenAI, Anthropic, Meta, Alphabet, DeepSeek), each prompted using material from the original Koditex corpus to preserve consistency in genre and topic distribution. Like its human-authored counterpart, the AI Koditex corpus is a mixed corpus encompassing a broad genre range. All texts are tokenized, lemmatized, and morphologically and syntactically annotated following the Universal Dependencies standard, and are released in both plain text and CoNLL-U formats. AI-Koditex is a large-scale Czech LLM-generated resource, developed to support cross-linguistic and cross-model research in AI-generated language.+Modeled on the original [[en:cnk:koditex|Koditex Corpus]], AI-Koditex was created to mirror its structure, genre balance, and linguistic diversity, enabling direct comparisons between human and machine-generated Czech texts. The corpus includes texts generated by 13 frontier LLMs from major developers (OpenAI, Anthropic, Meta, Alphabet, DeepSeek), each prompted using material from the original Koditex corpus to preserve consistency in genre and topic distribution. Like its human-authored counterpart, the AI Koditex corpus is a mixed corpus encompassing a broad genre range. All texts are tokenized, lemmatized, and morphologically and syntactically annotated following the Universal Dependencies standard, and are released in both plain text and CoNLL-U formats. AI-Koditex is a large-scale Czech LLM-generated resource, developed to support cross-linguistic and cross-model research in AI-generated language.
  
  
Řádek 26: Řádek 25:
   * Prompt portion: The first 500 words (including punctuation) served as generation prompts   * Prompt portion: The first 500 words (including punctuation) served as generation prompts
  
-  * Reference portion: The remaining text (approximately 1,500 words) provided human-authored comparison material+  * Reference portion: The remaining text (approximately 1,500 words) can provide human-authored comparison material
  
 This segmentation strategy ensured that models received sufficient context for generation while maintaining substantial reference text for comparative analysis. Also, the context of 500 words left sufficient space in the context window even for older models (davinci-002 has maximum context of 2049 tokens, while 500 English words takes about 670 tokens). This segmentation strategy ensured that models received sufficient context for generation while maintaining substantial reference text for comparative analysis. Also, the context of 500 words left sufficient space in the context window even for older models (davinci-002 has maximum context of 2049 tokens, while 500 English words takes about 670 tokens).
Řádek 38: Řádek 37:
 For base models operating in completion mode (davinci-002, GPT-3.5-turbo, Meta-Llama-3.1-405B), we used only the first portion of each source text as input, allowing the models to function as traditional language models for text prediction.  For base models operating in completion mode (davinci-002, GPT-3.5-turbo, Meta-Llama-3.1-405B), we used only the first portion of each source text as input, allowing the models to function as traditional language models for text prediction. 
  
-For instruction-tuned models, we employed minimal system prompts requesting long continuation of given text. Without such prompts, models' default \emph{helpful assistantpersona emerged who typically attempted to analyze, summarize, or answer questions within the source text rather than continuing it. Language-specific challenges emerged during Czech texts generation. Some models refused to cooperate when given Czech system prompts, necessitating English system prompt as quoted below. Despite explicit instructions to generate Czech text, several models sometimes produced English or mixed-language outputs.+For instruction-tuned models, we employed minimal system prompts requesting long continuation of given text. Without such prompts, models' default //helpful assistant// persona emerged who typically attempted to analyze, summarize, or answer questions within the source text rather than continuing it. Language-specific challenges emerged during Czech texts generation. Some models refused to cooperate when given Czech system prompts, necessitating English system prompt as quoted below. Despite explicit instructions to generate Czech text, several models sometimes produced English or mixed-language outputs.
  
-We used the following system prompt: Please continue the Czech text in the same language, manner and style, ensuring it contains at least five thousand words. The text does not need to be factually correct, but please make sure it fits stylistically.+We used the following system prompt: //Please continue the Czech text in the same language, manner and style, ensuring it contains at least five thousand words. The text does not need to be factually correct, but please make sure it fits stylistically.//
  
 To ensure reproducibility, we used random seed 42 for all OpenAI API calls. Unfortunately, other providers do not offer comparable deterministic generation options. Llama generations used 16-bit floating-point quantization (the highest available quality). To ensure reproducibility, we used random seed 42 for all OpenAI API calls. Unfortunately, other providers do not offer comparable deterministic generation options. Llama generations used 16-bit floating-point quantization (the highest available quality).
Řádek 56: Řádek 55:
  
 ==== How to cite AI-Koditex ==== ==== How to cite AI-Koditex ====
- 
 <WRAP round tip 70%> <WRAP round tip 70%>
-Milička, J. – Marklová, A. – Cvrček, V.//AI-Koditex//. Department of Linguistics, Faculty of Arts, Charles University, Prague 2025. Available at WWW: www.korpus.cz+Milička, J. – Marklová, A. – Cvrček, V. (2025): //AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts//. Arxiv preprint: [[https://arxiv.org/abs/2509.22996]] 
 + 
 + 
 +Milička, J. – Marklová, A. – Cvrček, V.: //AI-Koditex, version 1, 1. 7. 2025//. Department of Linguistics, Faculty of Arts, Charles University, Prague 2025. Available at WWW: www.korpus.cz
 </WRAP> </WRAP>