Rozdíly
Zde můžete vidět rozdíly mezi vybranou verzí a aktuální verzí dané stránky.
| Obě strany předchozí revizePředchozí verzeNásledující verze | Předchozí verze | ||
| cnk:aikoditex [2025/06/27 12:34] – [AI-Koditex] annamarklova | cnk:aikoditex [2025/10/13 14:08] (aktuální) – [How to cite AI-Koditex] jirimilicka | ||
|---|---|---|---|
| Řádek 3: | Řádek 3: | ||
| AI-Koditex is a generated, annotated, multi-genre corpus of Czech texts produced by large language models (LLMs). | AI-Koditex is a generated, annotated, multi-genre corpus of Czech texts produced by large language models (LLMs). | ||
| - | |||
| <WRAP right 35%> | <WRAP right 35%> | ||
| - | ^ <fs medium> | + | ^ <fs medium> |
| - | ^ Positions ^ Number of positions (tokens) | 24 586 730 | | + | ^ Positions ^ Number of positions (tokens) | 24 030 795 | |
| - | ^ ::: ^ Number of positions (excl. punctuation) | 20 658 912 | | + | ^ ::: ^ Number of positions (excl. punctuation) | 20 180 737 | |
| - | ^ ::: ^ Number of word forms (excl. punctuation) | | + | ^ ::: ^ Number of word forms (excl. punctuation) | |
| - | ^ ::: ^ Number of lemmas (excl. punctuation) | 223 927 | | + | ^ ::: ^ Number of lemmas (excl. punctuation) | 223 122 |
| - | ^ Further information ^ Number of subcorpora | | + | ^ Further information ^ Number of subcorpora | |
| ^ ::: ^ Number of models | 15 | | ^ ::: ^ Number of models | 15 | | ||
| ^ ::: ^ Publication year | 2025 | | ^ ::: ^ Publication year | 2025 | | ||
| Řádek 26: | Řádek 25: | ||
| * Prompt portion: The first 500 words (including punctuation) served as generation prompts | * Prompt portion: The first 500 words (including punctuation) served as generation prompts | ||
| - | * Reference portion: The remaining text (approximately 1,500 words) | + | * Reference portion: The remaining text (approximately 1,500 words) |
| This segmentation strategy ensured that models received sufficient context for generation while maintaining substantial reference text for comparative analysis. Also, the context of 500 words left sufficient space in the context window even for older models (davinci-002 has maximum context of 2049 tokens, while 500 English words takes about 670 tokens). | This segmentation strategy ensured that models received sufficient context for generation while maintaining substantial reference text for comparative analysis. Also, the context of 500 words left sufficient space in the context window even for older models (davinci-002 has maximum context of 2049 tokens, while 500 English words takes about 670 tokens). | ||
| Řádek 38: | Řádek 37: | ||
| For base models operating in completion mode (davinci-002, | For base models operating in completion mode (davinci-002, | ||
| - | For instruction-tuned models, we employed minimal system prompts requesting long continuation of given text. Without such prompts, models' | + | For instruction-tuned models, we employed minimal system prompts requesting long continuation of given text. Without such prompts, models' |
| - | We used the following system prompt: Please continue the Czech text in the same language, manner and style, ensuring it contains at least five thousand words. The text does not need to be factually correct, but please make sure it fits stylistically. | + | We used the following system prompt: |
| To ensure reproducibility, | To ensure reproducibility, | ||
| Řádek 56: | Řádek 55: | ||
| ==== How to cite AI-Koditex ==== | ==== How to cite AI-Koditex ==== | ||
| - | |||
| <WRAP round tip 70%> | <WRAP round tip 70%> | ||
| - | Milička, J. – Marklová, A. – Cvrček, V.// AI-Koditex //. Department of Linguistics, | + | Milička, J. – Marklová, A. – Cvrček, V. (2025): //AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts//. Arxiv preprint: [[https:// |
| + | |||
| + | |||
| + | Milička, J. – Marklová, A. – Cvrček, V.: // | ||
| </ | </ | ||