AplikaceAplikace
Nastavení

Rozdíly

Zde můžete vidět rozdíly mezi vybranou verzí a aktuální verzí dané stránky.

Odkaz na výstup diff

Obě strany předchozí revizePředchozí verze
Následující verze
Předchozí verze
cnk:aibrown [2025/06/27 14:03] – [How to cite AI-Koditex] annamarklovacnk:aibrown [2025/10/13 14:10] (aktuální) – [How to cite AI-Brown] jirimilicka
Řádek 5: Řádek 5:
  
  
-<WRAP right 35%> +<WRAP right 40%> 
-^ <fs medium>Name</fs> ^^ <fs medium>AI-Brown</fs> ^+^ <fs medium>Name</fs> ^^ <fs medium>AI-Brown v1</fs> ^
 ^ Positions ^ Number of positions (tokens) |  27 661 454 |   ^ Positions ^ Number of positions (tokens) |  27 661 454 |  
 ^ ::: ^ Number of positions (excl. punctuation) |  23 975 982 | ^ ::: ^ Number of positions (excl. punctuation) |  23 975 982 |
Řádek 16: Řádek 16:
 </WRAP> </WRAP>
  
-Modeled on the BE21 Corpus—a modern implementation of the original Brown Corpus—AI-Brown was created to replicate its structure, genre diversity, and linguistic richness, enabling systematic comparisons between human and machine-generated English texts. The corpus comprises outputs from 13 frontier LLMs developed by OpenAI, Anthropic, Meta, Alphabet, and DeepSeek. Each model was prompted using the first 500 words of BE21 text samples, with the remaining portion reserved as human-authored reference material, ensuring genre-aligned and topically consistent comparisons. Like BE21, AI-Brown spans a wide range of contemporary English genres. All generated texts are tokenized, lemmatized, and annotated morphologically and syntactically using the Universal Dependencies framework, and are provided in both plain text and CoNLL-U formats. AI-Brown is a large-scale English LLM-generated corpus explicitly designed for cross-model and human-machine linguistic analysis.+Modeled on the BE21 Corpus((Baker, P. (2023) A year to remember? Introducing the BE21 corpus and exploring recent part of speech tag change in British English. International Journal of Corpus Linguistics.))  — a modern implementation of the original Brown Corpus—AI-Brown was created to replicate its structure, genre diversity, and linguistic richness, enabling systematic comparisons between human and machine-generated English texts. The corpus comprises outputs from 13 frontier LLMs developed by OpenAI, Anthropic, Meta, Alphabet, and DeepSeek. Each model was prompted using the first 500 words of BE21 text samples, with the remaining portion reserved as human-authored reference material, ensuring genre-aligned and topically consistent comparisons. Like BE21, AI-Brown spans a wide range of contemporary English genres. All generated texts are tokenized, lemmatized, and annotated morphologically and syntactically using the Universal Dependencies framework, and are provided in both plain text and CoNLL-U formats. AI-Brown is a large-scale English LLM-generated corpus explicitly designed for cross-model and human-machine linguistic analysis.
  
  
Řádek 22: Řádek 22:
  
  
-The original reference BE21 Corpus was available in vertical format via the Czech National Corpus infrastructure. The preprocessing pipeline included several steps to prepare the data for prompt-based generation. Clean texts and metadata were extracted from the verticals, and structural tags were aligned with the Czech corpus format to ensure cross-linguistic consistency.+The preprocessing pipeline for the original reference BE21 corpus included several steps to prepare the data for prompt-based generation. Clean texts and metadata were extracted from the verticals, and structural tags were aligned with the Czech corpus format to ensure cross-linguistic consistency.
  
 Each BE21 text sample was split into two parts to support controlled generation: Each BE21 text sample was split into two parts to support controlled generation:
Řádek 53: Řádek 53:
  
  
-==== How to cite AI-Koditex ====+==== How to cite AI-Brown ====
  
 <WRAP round tip 70%> <WRAP round tip 70%>
-Milička, J. – Marklová, A. – Cvrček, V.// AI-Brown //. Department of Linguistics, Faculty of Arts, Charles University, Prague 2025. Available at WWW: www.korpus.cz+Milička, J. – Marklová, A. – Cvrček, V. (2025): //AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts//. Arxiv preprint: [[https://arxiv.org/abs/2509.22996]] 
 + 
 +Milička, J. – Marklová, A. – Cvrček, V.: //AI-Brown, version 1, 1. 7. 2025//. Department of Linguistics, Faculty of Arts, Charles University, Prague 2025. Available at WWW: www.korpus.cz
 </WRAP> </WRAP>