This is an old revision of the document!

AI-Brown

AI-Brown is a generated, annotated, multi-genre corpus of English texts produced by large language models (LLMs).

Name		AI-Brown
Positions	Number of positions (tokens)	27 661 454
	Number of positions (excl. punctuation)	23 975 982
	Number of word forms (excl. punctuation)	125 896
	Number of lemmas (excl. punctuation)	110 835
Further information	Number of sub-corpora	32
	Number of models	16
	Publication year	2025

Modeled on the BE21 Corpus¹⁾ — a modern implementation of the original Brown Corpus—AI-Brown was created to replicate its structure, genre diversity, and linguistic richness, enabling systematic comparisons between human and machine-generated English texts. The corpus comprises outputs from 13 frontier LLMs developed by OpenAI, Anthropic, Meta, Alphabet, and DeepSeek. Each model was prompted using the first 500 words of BE21 text samples, with the remaining portion reserved as human-authored reference material, ensuring genre-aligned and topically consistent comparisons. Like BE21, AI-Brown spans a wide range of contemporary English genres. All generated texts are tokenized, lemmatized, and annotated morphologically and syntactically using the Universal Dependencies framework, and are provided in both plain text and CoNLL-U formats. AI-Brown is a large-scale English LLM-generated corpus explicitly designed for cross-model and human-machine linguistic analysis.

¹⁾

Baker, P. (2023) A year to remember? Introducing the BE21 corpus and exploring recent part of speech tag change in British English. International Journal of Corpus Linguistics.

Trace: • chi2 • heaps • lemma • sp • containing • net • aibrown

AI-Brown

Search

Navigation

Print/export

Tools

Languages

Licence