AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:obc [2020/02/14 11:08] – [OBC: The Old Bailey Corpus 2.0] Michal Křenen:cnk:obc [2021/02/10 15:39] (current) – [How to cite] Michal Křen
Line 1: Line 1:
 ====== OBC: The Old Bailey Corpus 2.0 ====== ====== OBC: The Old Bailey Corpus 2.0 ======
  
-The [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]] is an annotated corpus based on the [[http://www.oldbaileyonline.org|Proceedings of Old Bailey]]. These speech-related texts document Late Modern English as used in London’s Central Criminal Court from 1674 to 1913. OBC has been adapted by CNC for the use in KonText+The [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|Old Bailey Corpus]] is a sociolinguistically, pragmatically and textually annotated corpus based on a selection of the [[http://www.oldbaileyonline.org|Proceedings of Old Bailey]]. It consists of 637 texts recording trial proceedings which took place between 1720 and 1913 at Old Bailey, LondonThere are more than 24 million words in the corpus - its overall size is over 35 million tokens (including words, punctuation, etc.). More detailed information about the corpus is available [[http://fedora.clarin-d.uni-saarland.de/oldbailey/index.html|here]], as well as in the official [[https://fedora.clarin-d.uni-saarland.de/oldbailey/downloads/OBC_2.0_Manual%202016-07-13.pdf|OBC Manual]].
  
 The corpus is licensed under [[https://creativecommons.org/licenses/by-nc-sa/4.0/|CC BY-NC-SA 4.0]]. The corpus is licensed under [[https://creativecommons.org/licenses/by-nc-sa/4.0/|CC BY-NC-SA 4.0]].
 +
 +{{:en:obc01.png?400|}}
 +
 +//Front matter of the Proceedings of the Old Bailey, 18th February 1830, page 2//
 +===== The digitalization process =====
 +
 +The original pages of the Proceedings were scanned and the scans are now available at [[https://www.oldbaileyonline.org/index.jsp|Old Bailey Online]]; you can access individual scans by clicking on the “see original” link on the right of the text of any trial (e.g. [[https://www.oldbaileyonline.org/browse.jsp?div=t18020217-3|here]]). The texts were then manually transcribed by multiple typists and an optical character recognition (OCR) software was employed to create transcriptions for comparison so any differences or inaccuracies could be resolved. However, as the original pages are often faded or otherwise damaged (see, for example, [[https://www.oldbaileyonline.org/images.jsp?doc=OA174005070012|here]]), it is not always possible to guarantee a 100% accuracy of the transcriptions. Users are therefore advised to consult the scanned pages when a very precise reading is required. More on the digitalization process [[https://www.oldbaileyonline.org/static/Project.jsp#methods|here]].
 +
 +The texts were marked-up in XML (Extensible Markup Language) according to the [https://tei-c.org/|TEI]] (Text Encoding Initiative) guidelines.
 +
 +Every single //doc// structure represents one proceeding and consists of multiple //text// structures, the first of which is usually the front matter (or else according to the //type// attribute) and the following contain the trial account itself. 
 +
 +Each text of the OBC is annotated for its metainformation, including the date of the trial, the year of publishing, the categories and subcategories of the offences, the verdicts, and the punishments. 
 +
 +In the trial account, the direct speeches are tagged for individual //utterance//s; each utterance in the text is also tagged for various metadata, such as the gender, age, occupation (see [[https://iisg.amsterdam/en/data/data-websites/history-of-work|HISCO]]), and social class (see HISCLASS FIXME!) of the speaker of the utterance, the speaker’s role in the court, the scribe, the printer, and the publisher of the individual proceedings. More information about the metadata can be found in [[en:obc:intro_to_metadata|Lesson 5]].
 + 
 +Single words are assigned a part-of-speech (POS) tags according to the [[http://ucrel.lancs.ac.uk/claws7tags.html|CLAWS 7]] tagset; more information on the POS tagging process is available [[http://ucrel.lancs.ac.uk/claws/|here]].
 +
 +Please note that we have changed some of the tagging of the original corpus by Huber, Nissel and Puga. In the original data, some of the attributes such as offences, verdicts marked those parts of the proceedings that spelled them out. For example, when the text noted that a particular defendant was charged with murder, the word murder or the sentence containing it would be tagged as an offence with an attribute of murder. We copied these attributes to //text// making it much easier to form queries such as “find all adjectives spoken by female defendants in trials concerned with murder and ending in acquittal”, although it also causes certain problems when multiple different offences, verdicts or punishments are mentioned in the same trial (see lessons [[en:obc:intro_to_metadata|5]] and [[en:obc:specific_query|6]]).
 +
 +{{:en:obc02.png?400|}}
 +
 +Trials 652-5 in Proceedings of the Old Bailey, 18th February 1830, page 73
 +
 +
 +
 +
  
 ===== Wiki course ===== ===== Wiki course =====
Line 9: Line 36:
 For a basic overview of how to use the OBC corpus and how to input the data into the search interface check our wiki-course in eight lessons: For a basic overview of how to use the OBC corpus and how to input the data into the search interface check our wiki-course in eight lessons:
  
-  * [[en:eebo:first_query|Lesson 1 (First query)]] +  * [[en:obc:query_types|Lesson 1 (Query types)]] 
-  * [[en:eebo:orthography_spelling|Lesson 2 (Orthography and Spelling)]] +  * [[en:obc:spelling|Lesson 2 (Spelling)]] 
-  * [[en:eebo:competing_forms|Lesson 3 (Competing forms)]] +  * [[en:obc:spell2|Lesson 3 (Spelling variation continued)]] 
-  * [[en:eebo:specify_query|Lesson 4 (Specify query)]] +  * [[en:obc:spell3|Lesson 4 (Spelling III: Searching with tags)]] 
-  * [[en:eebo:collocations|Lesson 5 (Collocations)]] +  * [[en:obc:intro_to_metadata|Lesson 5 (Introduction to metadata)]] 
-  * [[en:eebo:morphology1|Lesson 6 (Morphology I)]] +  * [[en:obc:specific_query|Lesson 6 (Specify query: Metadata continued))]] 
-  * [[en:eebo:morphology2|Lesson 7 (Morphology II)]] +  * [[en:obc:frequency_distribution|Lesson 7 (Two-attribute interrelationship frequency distribution)]] 
-  * [[en:eebo:multiword|Lesson 8 (Multiword expressions)]] +  * [[en:obc:collocations|Lesson 8 (Collocations)]]
  
 ===== How to cite ===== ===== How to cite =====
  
 <WRAP round tip 70%> <WRAP round tip 70%>
-//OBC: The Old Bailey Corpus 2.0//. Ústav Českého národního korpusu FF UK, Prague 2020. Available from WWW: http://www.korpus.cz+//OBC: The Old Bailey Corpus 2.0//. Ústav Českého národního korpusu FF UK, Prague 2021. Available from WWW: http://www.korpus.cz
  
 **The original Old Bailey Corpus**: Huber, M. - Nissel, M. - Puga, K. (2016): //Old Bailey Corpus 2.0//. [[http://hdl.handle.net/11858/00-246C-0000-0023-8CFB-2|hdl:11858/00-246C-0000-0023-8CFB-2]] **The original Old Bailey Corpus**: Huber, M. - Nissel, M. - Puga, K. (2016): //Old Bailey Corpus 2.0//. [[http://hdl.handle.net/11858/00-246C-0000-0023-8CFB-2|hdl:11858/00-246C-0000-0023-8CFB-2]]