Toto je starší verze dokumentu!

AI-Brown

AI-Brown is a generated, annotated, multi-genre corpus of English texts produced by large language models (LLMs).

Name		AI-Brown
Positions	Number of positions (tokens)	27 661 454
	Number of positions (excl. punctuation)	23 975 982
	Number of word forms (excl. punctuation)	125 896
	Number of lemmas (excl. punctuation)	110 835
Further information	Number of sub-corpora	32
	Number of models	16
	Publication year	2025

Modeled on the BE21 Corpus—a modern implementation of the original Brown Corpus—AI-Brown was created to replicate its structure, genre diversity, and linguistic richness, enabling systematic comparisons between human and machine-generated English texts. The corpus comprises outputs from 13 frontier LLMs developed by OpenAI, Anthropic, Meta, Alphabet, and DeepSeek. Each model was prompted using the first 500 words of BE21 text samples, with the remaining portion reserved as human-authored reference material, ensuring genre-aligned and topically consistent comparisons. Like BE21, AI-Brown spans a wide range of contemporary English genres. All generated texts are tokenized, lemmatized, and annotated morphologically and syntactically using the Universal Dependencies framework, and are provided in both plain text and CoNLL-U formats. AI-Brown is the first large-scale English LLM-generated corpus explicitly designed for cross-model and human-machine linguistic analysis.

Corpus preparation

The original reference Koditex Corpus was originally available in vertical format from the Czech National Corpus infrastructure. The pre-processing pipeline involved several steps to prepare prompts suitable for generation. Clean texts and metadata were extracted from the verticals, and structural tags were standardized. Each text sample was divided into two portions to enable prompt-based generation:

Prompt portion: The first 500 words (including punctuation) served as generation prompts

Reference portion: The remaining text (approximately 1,500 words) provided human-authored comparison material

This segmentation strategy ensured that models received sufficient context for generation while maintaining substantial reference text for comparative analysis. Also, the context of 500 words left sufficient space in the context window even for older models (davinci-002 has maximum context of 2049 tokens, while 500 English words takes about 670 tokens).

Importantly, unlike the Koditex corpus, AI-Koditex contains written texts only, one sample per source text to avoid over-representation. The final dataset contains 676 text samples.

Generating corpora

For each model, we generated two versions of each corpus: one using temperature 0 (deterministic generation) and one using temperature 1 (stochastic generation). However, we encountered mode collapse with the oldest model (davinci-002) at zero temperature, resulting in constant repetition of identical sentences. Additionally, this early model failed to produce coherent Czech text.

For base models operating in completion mode (davinci-002, GPT-3.5-turbo, Meta-Llama-3.1-405B), we used only the first portion of each source text as input, allowing the models to function as traditional language models for text prediction.

For instruction-tuned models, we employed minimal system prompts requesting long continuation of given text. Without such prompts, models' default \emph{helpful assistant} persona emerged who typically attempted to analyze, summarize, or answer questions within the source text rather than continuing it. Language-specific challenges emerged during Czech texts generation. Some models refused to cooperate when given Czech system prompts, necessitating English system prompt as quoted below. Despite explicit instructions to generate Czech text, several models sometimes produced English or mixed-language outputs.

We used the following system prompt: Please continue the Czech text in the same language, manner and style, ensuring it contains at least five thousand words. The text does not need to be factually correct, but please make sure it fits stylistically.

To ensure reproducibility, we used random seed 42 for all OpenAI API calls. Unfortunately, other providers do not offer comparable deterministic generation options. Llama generations used 16-bit floating-point quantization (the highest available quality).

API responses were preserved in their entirety, including token probabilities and alternative tokens when available, to enable future analysis of generation uncertainty and model confidence.

Post-processing

Texts that were too short were removed. For instruction-tuned models, the original phrases such as „I'd be happy to continue“ were also removed.

Annotation

We used Universal Dependencies for annotation, as UDPipe represents the state of the art for multi-level linguistic processing (including tokenization, lemmatization, syntax, and morphology). The resulting annotation is in the CoNLL-U format, which is a widely adopted standard compatible with most modern NLP tools.

How to cite AI-Koditex

Milička, J. – Marklová, A. – Cvrček, V. AI-Koditex . Department of Linguistics, Faculty of Arts, Charles University, Prague 2025. Available at WWW: www.korpus.cz

Historie: • veda • aibrown

AI-Brown

Corpus preparation

Generating corpora

Post-processing

Annotation

How to cite AI-Koditex

Hledat

Navigace

Tisk/export

Nástroje

Jazyky

Licence