OBC: The Old Bailey Corpus 2.0

The Old Bailey Corpus is a sociolinguistically, pragmatically and textually annotated corpus based on a selection of the Proceedings of Old Bailey. It consists of 637 texts recording trial proceedings which took place between 1720 and 1913 at Old Bailey, London. There are more than 24 million words in the corpus - its overall size is over 35 million tokens (including words, punctuation, etc.). More detailed information about the corpus is available here, as well as in the official OBC Manual.

The corpus is licensed under CC BY-NC-SA 4.0.

Front matter of the Proceedings of the Old Bailey, 18th February 1830, page 2

The digitalization process

The original pages of the Proceedings were scanned and the scans are now available at Old Bailey Online; you can access individual scans by clicking on the “see original” link on the right of the text of any trial (e.g. here). The texts were then manually transcribed by multiple typists and an optical character recognition (OCR) software was employed to create transcriptions for comparison so any differences or inaccuracies could be resolved. However, as the original pages are often faded or otherwise damaged (see, for example, here), it is not always possible to guarantee a 100% accuracy of the transcriptions. Users are therefore advised to consult the scanned pages when a very precise reading is required. More on the digitalization process here.

The texts were marked-up in XML (Extensible Markup Language) according to the [https://tei-c.org/|TEI]] (Text Encoding Initiative) guidelines.

Every single doc structure represents one proceeding and consists of multiple text structures, the first of which is usually the front matter (or else according to the type attribute) and the following contain the trial account itself.

Each text of the OBC is annotated for its metainformation, including the date of the trial, the year of publishing, the categories and subcategories of the offences, the verdicts, and the punishments.

In the trial account, the direct speeches are tagged for individual utterances; each utterance in the text is also tagged for various metadata, such as the gender, age, occupation (see HISCO), and social class (see HISCLASS FIXME!) of the speaker of the utterance, the speaker’s role in the court, the scribe, the printer, and the publisher of the individual proceedings. More information about the metadata can be found in Lesson 5.

Single words are assigned a part-of-speech (POS) tags according to the CLAWS 7 tagset; more information on the POS tagging process is available here.

Please note that we have changed some of the tagging of the original corpus by Huber, Nissel and Puga. In the original data, some of the attributes such as offences, verdicts marked those parts of the proceedings that spelled them out. For example, when the text noted that a particular defendant was charged with murder, the word murder or the sentence containing it would be tagged as an offence with an attribute of murder. We copied these attributes to text making it much easier to form queries such as “find all adjectives spoken by female defendants in trials concerned with murder and ending in acquittal”, although it also causes certain problems when multiple different offences, verdicts or punishments are mentioned in the same trial (see lessons 5 and 6).

Trials 652-5 in Proceedings of the Old Bailey, 18th February 1830, page 73

Wiki course

How to cite

OBC: The Old Bailey Corpus 2.0. Ústav Českého národního korpusu FF UK, Prague 2021. Available from WWW: http://www.korpus.cz

The original Old Bailey Corpus: Huber, M. - Nissel, M. - Puga, K. (2016): Old Bailey Corpus 2.0. hdl:11858/00-246C-0000-0023-8CFB-2

The Old Bailey Proceedings Online: Hitchcock, T. - Shoemaker, R. - Emsley, C. - Howard, S. - McLaughlin, J. et al. (2012): The Old Bailey Proceedings Online, 1674-1913. www.oldbaileyonline.org, version 7.0, 24 March 2012.