AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
en:cnk:aibrown [2025/09/24 02:05] – created jirimilickaen:cnk:aibrown [2025/10/13 14:10] (current) – [How to cite AI-Brown] jirimilicka
Line 1: Line 1:
 +~~NOTOC~~
 ====== AI-Brown ====== ====== AI-Brown ======
  
Line 5: Line 6:
  
 <WRAP right 40%> <WRAP right 40%>
-^ <fs medium>Name</fs> ^^ <fs medium>AI-Brown</fs> ^+^ <fs medium>Name</fs> ^^ <fs medium>AI-Brown v1</fs> ^
 ^ Positions ^ Number of positions (tokens) |  27 661 454 |   ^ Positions ^ Number of positions (tokens) |  27 661 454 |  
 ^ ::: ^ Number of positions (excl. punctuation) |  23 975 982 | ^ ::: ^ Number of positions (excl. punctuation) |  23 975 982 |
Line 18: Line 19:
  
  
 +===== Corpus preparation =====
 +
 +
 +The preprocessing pipeline for the original reference BE21 corpus included several steps to prepare the data for prompt-based generation. Clean texts and metadata were extracted from the verticals, and structural tags were aligned with the Czech corpus format to ensure cross-linguistic consistency.
 +
 +Each BE21 text sample was split into two parts to support controlled generation:
 +
 +  * Prompt portion: The first 500 words (including punctuation) served as generation prompts
 +
 +  * Reference portion: The remaining text (approximately 1,500 words) provided human-authored comparison material
 +
 +This segmentation ensured that models received sufficient context for meaningful text generation while preserving a substantial portion of reference text for evaluation. A 500-word prompt fits comfortably within the input limits of older models (e.g., davinci-002, which has a 2,049-token context window), using roughly 670 tokens.
 +
 +To maintain comparability with the Czech AI corpus [[cnk:aikoditex|AI-Koditex]] and avoid over-representation, we selected written texts only, limiting the sample to one excerpt per source text. The final AI-Brown dataset contains 500 samples, matching the original structure of BE21.
 +
 +===== Generating corpora =====
 +
 +For each model, we generated two versions of each corpus: one using temperature 0 (deterministic generation) and one using temperature 1 (stochastic generation). However, we encountered mode collapse with the oldest model (davinci-002) at zero temperature, resulting in constant repetition of identical sentences.
 +
 +For base models operating in completion mode (davinci-002, GPT-3.5-turbo, Meta-Llama-3.1-405B), we used only the first portion of each source text as input, allowing the models to function as traditional language models for text prediction. For instruction-tuned models, we employed minimal system prompts requesting long continuation of given text. Without such prompts, models' default //helpful assistant// persona emerged who typically attempted to analyze, summarize, or answer questions within the source text rather than continuing it. We used the following system prompt: //Please continue the text in the same manner and style, ensuring it contains at least five thousand words. The text does not need to be factually correct, but please make sure it fits stylistically.//
 +
 +To ensure reproducibility, we used random seed 42 for all OpenAI API calls. Unfortunately, other providers do not offer comparable deterministic generation options. Llama generations used 16-bit floating-point quantization (the highest available quality).
 +
 +API responses were preserved in their entirety, including token probabilities and alternative tokens when available, to enable future analysis of generation uncertainty and model confidence.
 +
 +===== Post-processing =====
 +
 +Texts that were too short were removed. For instruction-tuned models, the original phrases such as "I'd be happy to continue" were also removed.
 +
 +===== Annotation =====
 +
 +We used Universal Dependencies for annotation, as UDPipe represents the state of the art for multi-level linguistic processing (including tokenization, lemmatization, syntax, and morphology). The resulting annotation is in the CoNLL-U format, which is a widely adopted standard compatible with most modern NLP tools.
 +
 +
 +==== How to cite AI-Brown ====
 +
 +<WRAP round tip 70%>
 +Milička, J. – Marklová, A. – Cvrček, V. (2025): //AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts//. Arxiv preprint: [[https://arxiv.org/abs/2509.22996]]
 +
 +Milička, J. – Marklová, A. – Cvrček, V.: //AI-Brown, version 1, 1. 7. 2025//. Department of Linguistics, Faculty of Arts, Charles University, Prague 2025. Available at WWW: www.korpus.cz
 +</WRAP>