Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| en:cnk:aibrown [2025/09/24 02:05] – created jirimilicka | en:cnk:aibrown [2025/10/13 14:10] (current) – [How to cite AI-Brown] jirimilicka | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ~~NOTOC~~ | ||
| ====== AI-Brown ====== | ====== AI-Brown ====== | ||
| Line 5: | Line 6: | ||
| <WRAP right 40%> | <WRAP right 40%> | ||
| - | ^ <fs medium> | + | ^ <fs medium> |
| ^ Positions ^ Number of positions (tokens) | 27 661 454 | | ^ Positions ^ Number of positions (tokens) | 27 661 454 | | ||
| ^ ::: ^ Number of positions (excl. punctuation) | 23 975 982 | | ^ ::: ^ Number of positions (excl. punctuation) | 23 975 982 | | ||
| Line 18: | Line 19: | ||
| + | ===== Corpus preparation ===== | ||
| + | |||
| + | |||
| + | The preprocessing pipeline for the original reference BE21 corpus included several steps to prepare the data for prompt-based generation. Clean texts and metadata were extracted from the verticals, and structural tags were aligned with the Czech corpus format to ensure cross-linguistic consistency. | ||
| + | |||
| + | Each BE21 text sample was split into two parts to support controlled generation: | ||
| + | |||
| + | * Prompt portion: The first 500 words (including punctuation) served as generation prompts | ||
| + | |||
| + | * Reference portion: The remaining text (approximately 1,500 words) provided human-authored comparison material | ||
| + | |||
| + | This segmentation ensured that models received sufficient context for meaningful text generation while preserving a substantial portion of reference text for evaluation. A 500-word prompt fits comfortably within the input limits of older models (e.g., davinci-002, | ||
| + | |||
| + | To maintain comparability with the Czech AI corpus [[cnk: | ||
| + | |||
| + | ===== Generating corpora ===== | ||
| + | |||
| + | For each model, we generated two versions of each corpus: one using temperature 0 (deterministic generation) and one using temperature 1 (stochastic generation). However, we encountered mode collapse with the oldest model (davinci-002) at zero temperature, | ||
| + | |||
| + | For base models operating in completion mode (davinci-002, | ||
| + | |||
| + | To ensure reproducibility, | ||
| + | |||
| + | API responses were preserved in their entirety, including token probabilities and alternative tokens when available, to enable future analysis of generation uncertainty and model confidence. | ||
| + | |||
| + | ===== Post-processing ===== | ||
| + | |||
| + | Texts that were too short were removed. For instruction-tuned models, the original phrases such as " | ||
| + | |||
| + | ===== Annotation ===== | ||
| + | |||
| + | We used Universal Dependencies for annotation, as UDPipe represents the state of the art for multi-level linguistic processing (including tokenization, | ||
| + | |||
| + | |||
| + | ==== How to cite AI-Brown ==== | ||
| + | |||
| + | <WRAP round tip 70%> | ||
| + | Milička, J. – Marklová, A. – Cvrček, V. (2025): //AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts//. Arxiv preprint: [[https:// | ||
| + | |||
| + | Milička, J. – Marklová, A. – Cvrček, V.: //AI-Brown, version 1, 1. 7. 2025//. Department of Linguistics, | ||
| + | </ | ||