AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:obc:intro_to_metadata [2020/02/17 12:51] Jan Koceken:obc:intro_to_metadata [2021/02/10 18:26] (current) Michal Křen
Line 10: Line 10:
 |//collection//                |text collection                                       | |//collection//                |text collection                                       |
 |//date//                      |date of publishing                                    | |//date//                      |date of publishing                                    |
 +|//decade//                      |decade of publishing                                    |
 |//id//                        |identification number                                 | |//id//                        |identification number                                 |
 |//offenceCategory //           |type of offence(s) committed by the defendant(s)      | |//offenceCategory //           |type of offence(s) committed by the defendant(s)      |
 |//offenceSubcategory //        |subtype of offence(s) committed by the defendant(s)   | |//offenceSubcategory //        |subtype of offence(s) committed by the defendant(s)   |
 +|//period//                      |period of publishing                                    |
 |//punishmentCategory //        |type of punishment(s) inflicted on the defendant(s)   | |//punishmentCategory //        |type of punishment(s) inflicted on the defendant(s)   |
 |//punishmentSubcategory //     |subtype of punishment(s) inflicted on the defendant(s)| |//punishmentSubcategory //     |subtype of punishment(s) inflicted on the defendant(s)|
Line 24: Line 26:
 There are two issues to be addressed: firstly, it is the fact that not every variable is always known, therefore some information may occasionally be missing. There are two issues to be addressed: firstly, it is the fact that not every variable is always known, therefore some information may occasionally be missing.
  
-Secondly, oftentimes the trials involved multiple defendants (and hence multiple offences, punishments, victims etc.), and the way the trials were recorded and later tagged makes it often impossible (without reading the full trial accounts and sometimes not even after that) to distinguish which offence, verdict, punishment etc. are assigned to which defendant or which defendant is speaking at a given moment (more on this topic in Lesson 6).+Secondly, oftentimes the trials involved multiple defendants (and hence multiple offences, punishments, victims etc.), and the way the trials were recorded and later tagged makes it often impossible (without reading the full trial accounts and sometimes not even after that) to distinguish which offence, verdict, punishment etc. are assigned to which defendant or which defendant is speaking at a given moment (more on this topic in [[en:obc:specific_query|Lesson 6]]).
  
 The direct speech in the text is tagged as individual utterances, which are assigned the following parameters: The direct speech in the text is tagged as individual utterances, which are assigned the following parameters:
-  * Sociobiographical: gender, age, occupation (see [[https://iisg.amsterdam/en/data/data-websites/history-of-work|HISCO]]), and social class (see [[file:///T:/data/_wiki_grafika/korpus_OBC/hisclass-brief.doc|HISCLASS]]) of the speaker of the utterance+  * Sociobiographical: gender, age, occupation (see [[https://iisg.amsterdam/en/data/data-websites/history-of-work|HISCO]]), and social class (see [[https://iisg.amsterdam/en/detail?id=https%3A%2F%2Fiisg.amsterdam%2Fid%2Fdataset%2F364|HISCLASS]]) of the speaker of the utterance
   * Pragmatic: speaker’s role in the court (defendant, lawyer, judge, witness etc.)   * Pragmatic: speaker’s role in the court (defendant, lawyer, judge, witness etc.)
   * Textual: scribe, printer, publisher of the individual proceedings (these are already provided in the metadata of the text, but providing these parameters at the utterance level makes some type of queries much simpler)   * Textual: scribe, printer, publisher of the individual proceedings (these are already provided in the metadata of the text, but providing these parameters at the utterance level makes some type of queries much simpler)
Line 34: Line 36:
  
 |**“utterance” structure attributes**|**Description**                           |**“utterance” structure attributes**|**Description**                                                                                                                                                                                  | |**“utterance” structure attributes**|**Description**                           |**“utterance” structure attributes**|**Description**                                                                                                                                                                                  |
-|//editor //                         |editor of the text                        |//speaker_hisclass //               |social class of the speaker (according to [[file:///C:/Users/terez/Documents/VŠ/OBC/hisclass-brief.doc|HISCLASS]]<html><span style="color:#0563c1;"></html>)<html></u></html><html></span></html>|+|//editor //                         |editor of the text                        |//speaker_hisclass //               |social class of the speaker (according to [[https://iisg.amsterdam/en/detail?id=https%3A%2F%2Fiisg.amsterdam%2Fid%2Fdataset%2F364|HISCLASS]]<html><span style="color:#0563c1;"></html>)<html></u></html><html></span></html>|
 |//id //                             |identifier of the utterance               |//speaker_hiscoapprentice //        |is the speaker an apprentice?                                                                                                                                                                    | |//id //                             |identifier of the utterance               |//speaker_hiscoapprentice //        |is the speaker an apprentice?                                                                                                                                                                    |
 |//n //                              |number of the utterance in the proceedings|//speaker_hiscocode //              |code of the occupation of the speaker (according to [[https://iisg.amsterdam/en/data/data-websites/history-of-work|HISCO]])                                                                      | |//n //                              |number of the utterance in the proceedings|//speaker_hiscocode //              |code of the occupation of the speaker (according to [[https://iisg.amsterdam/en/data/data-websites/history-of-work|HISCO]])                                                                      |
Line 41: Line 43:
 |//printer //                        |printer of the text                       |//speaker_role //                   |the speaker’s role at the trial                                                                                                                                                                  | |//printer //                        |printer of the text                       |//speaker_role //                   |the speaker’s role at the trial                                                                                                                                                                  |
 |//publisher //                      |publisher of the text                     |//speaker_sex //                    |sex of the speaker                                                                                                                                                                               | |//publisher //                      |publisher of the text                     |//speaker_sex //                    |sex of the speaker                                                                                                                                                                               |
-|//scribe //                         |scribe                                    |//text_decade //                    |decade containing the year of publication of the text                                                                                                                                            |+|//scribe //                         |scribe                                    |//decade //                    |decade containing the year of publication of the text                                                                                                                                            |
 |//speaker_age //                    |age of the speaker                        |//trial //                          |trial identifier                                                                                                                                                                                 | |//speaker_age //                    |age of the speaker                        |//trial //                          |trial identifier                                                                                                                                                                                 |
 |//speaker_class //                  |social class of the speaker (high/low)    |//wc //                             |word count of the utterance                                                                                                                                                                      | |//speaker_class //                  |social class of the speaker (high/low)    |//wc //                             |word count of the utterance                                                                                                                                                                      |
Line 52: Line 54:
 **Searching the corpus** **Searching the corpus**
  
-Verbs in the progressive passive tense are formed by the auxiliary verb //be// followed by the present participle form //being// plus the past participle of a full verb, e.g. //I am being watched//, //the house was being built.// Searching for such constructions is done best by the use of tags (see Lesson 4).+Verbs in the progressive passive tense are formed by the auxiliary verb //be// followed by the present participle form //being// plus the past participle of a full verb, e.g. //I am being watched//, //the house was being built.// Searching for such constructions is done best by the use of tags (see [[en:obc:spell3|Lesson 4]]).
  
-For the auxiliary verb, we need to search for //am//, //are//, //is//, ''was ''and //were// (if we wish to include both present and past progressive passive) – tagged as VBM, VBR, VBZ, VBDZ and VBDR respectively. The tags should be used in the query instead of the full forms of the verbs, as the tags encompass the contracted forms as well as any unusual spellings which you would not be able to find just by searching for the full forms. In this case, it is not advisable to use all tags starting with V (using e.g. “V.*” expression), as the concordance would then include other verb forms as well. Rather, it is necessary to type out all the tags and separate them with the vertical bar |, which can be used inside the token:+For the auxiliary verb, we need to search for //am//, //are//, //is//, ''was ''and //were// (if we wish to include both present and past progressive passive) – tagged as VBM, VBR, VBZ, VBDZ and VBDR respectively. The tags should be used in the query instead of the full forms of the verbs, as the tags encompass the contracted forms as well as any unusual spellings which you would not be able to find just by searching for the full forms. In this case, it is not advisable to use all tags starting with V (using e.g. ''“V.*”'' expression), as the concordance would then include other verb forms as well. Rather, it is necessary to type out all the tags and separate them with the vertical bar |, which can be used inside the token:
  
-[tag="VBM|VBR|VBZ|VBDZ|VBDR"]+''[tag="VBM|VBR|VBZ|VBDZ|VBDR"]''
  
 The following element is //being//, which is invariable: The following element is //being//, which is invariable:
  
-[tag="VBM|VBR|VBZ|VBDZ|VBDR"] [word="being"]+''[tag="VBM|VBR|VBZ|VBDZ|VBDR"] [word="being"]''
  
-Alternatively, you can use the tag VBG ([tag="VBG"]) instead of the word being.+Alternatively, you can use the tag VBG (''[tag="VBG"]'') instead of the word //being//.
  
 For the lexical verb, we are looking for all past participles. According to the tagset, this verb form is tagged either as VVN or VVNK. Hence, we can use the shortened version VVN.*. The resulting query should look like this: For the lexical verb, we are looking for all past participles. According to the tagset, this verb form is tagged either as VVN or VVNK. Hence, we can use the shortened version VVN.*. The resulting query should look like this:
  
-[tag="VBM|VBR|VBZ|VBDZ|VBDR"] [word="being"] [tag="VVN.*"]+''[tag="VBM|VBR|VBZ|VBDZ|VBDR"] [word="being"] [tag="VVN.*"]''
  
-If you wish to see an overview of the structural attributes of the whole concordance along with their frequencies, click on Frequency → Text Types.+If you wish to see an overview of the structural attributes of the whole concordance along with their frequencies, click on //Frequency → Text Types//.
  
 {{:en:obc:l5_1.png?direct&600|}} {{:en:obc:l5_1.png?direct&600|}}
  
-This will provide you with lists of metainformation with their frequencies. For example, under the utterance.text_decade column, you can see in which decades the progressive passive was used most frequently:+This will provide you with lists of metainformation with their frequencies. For example, under the //utterance.text_decade// column, you can see in which decades the progressive passive was used most frequently:
  
 {{:en:obc:l5_2.png?direct&400|}} {{:en:obc:l5_2.png?direct&400|}}
  
-It is important to note here, that some of the utterances are not tagged fully; in this case, there are 48 utterances that are missing the information about the decade in which they were written. You can use the negative filter (p / **n**) to discard them and work only with the fully annotated data.+It is important to note here, that some of the utterances are not tagged fully; in this case, there are 48 utterances that are missing the information about the decade in which they were written. You can use the negative filter (p///n//) to discard them and work only with the fully annotated data.
  
 By clicking on the header of each column, you can change the sorting – alphabetically according to the labels of that attribute (here decades), according to the frequency or i.p.m. Here i.p.m. (Items Per Million) indicates the relative frequency of the given form in relation to the overall size of the part of the corpus tagged with the respective value of the structural attribute (e.g. in this case the number of occurrences per million tokens in each decade). The relative frequency allows for comparison of the number of occurrences in differently-sized parts of the corpus. By clicking on the header of each column, you can change the sorting – alphabetically according to the labels of that attribute (here decades), according to the frequency or i.p.m. Here i.p.m. (Items Per Million) indicates the relative frequency of the given form in relation to the overall size of the part of the corpus tagged with the respective value of the structural attribute (e.g. in this case the number of occurrences per million tokens in each decade). The relative frequency allows for comparison of the number of occurrences in differently-sized parts of the corpus.
- 
-{{:en:obc:l5_3.png?direct&400|}} 
  
 By changing the sorting to according to i.p.m. (marked by the little blue arrow), we can prove that the passive progressive tense was an innovation indeed, as the decades are in an almost perfect chronological order: By changing the sorting to according to i.p.m. (marked by the little blue arrow), we can prove that the passive progressive tense was an innovation indeed, as the decades are in an almost perfect chronological order:
Line 86: Line 86:
 If you wish to see the metadata of individual occurrences in the corpus, click on the blue ID number at the beginning of the line when viewing the concordance. If you wish to see the metadata of individual occurrences in the corpus, click on the blue ID number at the beginning of the line when viewing the concordance.
  
-{{Obrázek_4.png|Obrázek_4.png Obrázek_4.png}} +{{:en:obc:l5_3.png?direct&400|}}
- +
-Here you can see all the information available for the given utterance. As was mentioned above, some information may be missing. You can access the whole text of the proceeding including the scan of the original publication by clicking on the link under **text.url**. +
- +
-<html><u></html>Task<html></u></html>* Try searching for all occurrences of the '''split infinitive **(e.g. //to immediately follow//)** '''and '''double comparative '''(e.g. //more commoner//+
- +
-<HTML><ul></HTML> +
-<HTML><li></HTML><HTML><ul></HTML> +
-<HTML><li></HTML>Make sure the query type is set to CQL<HTML></li></HTML> +
-<HTML><li></HTML>Make use of the tags from the tagset<HTML></li></HTML> +
-<HTML><li></HTML>Look at the text types list and find when, in which contexts (e.g. type of offence) and by whom these structures were most frequently used<HTML></li></HTML><HTML></ul></HTML> +
-<HTML></li></HTML><HTML></ul></HTML> +
- +
-Solution: * [[https://kontext.korpus.cz/view?q=~jIj4JkJbEVZs|Split infinitive]]:+
  
-<HTML><ul></HTML> +Here you can see all the information available for the given utterance. As was mentioned above, some information may be missing. You can access the whole text of the proceeding including the scan of the original publication by clicking on the link under //text.url//.
-<HTML><li></HTML><HTML><ul></HTML> +
-<HTML><li></HTML>Query: [word="to"] [tag="RR"] [tag="VVI"]<HTML></li></HTML> +
-<HTML><li></HTML>Frequency → Text Types<HTML></li></HTML><HTML></ul></HTML> +
-<HTML></li></HTML><HTML></ul></HTML>+
  
-<HTML+<WRAP round help 40%
-<div style="margin-left:1.905cm;margin-right:0cm;"> +**Task:**
-</HTML> +
-{{Obrázek_2.png|fig:Obrázek_2.png}}{{Obrázek_5.png|fig:Obrázek_5.png}}+
  
-<HTML> +    * Try searching for all occurrences of the **split infinitive** (e.g. //to immediately follow//) and **double comparative** (e.g. //more commoner//) 
-</div> +    * Make sure the query type is set to CQL 
-</HTML> +    * Make use of the tags from the tagset 
-  * [[https://kontext.korpus.cz/view?q=~tAru1r2aLncK|Double comparative]]: +    * Look at the text types list and find when, in which contexts (e.g. type of offence) and by whom these structures were most frequently used 
-    * Query: [word="more"] [tag="JJR"] +</WRAP>
-    * Frequency → Text Types+
  
-<HTML> +You will find the solution [[en:obc:solution#lesson_5|here]]
-<div style="margin-left:1.905cm;margin-right:0cm;"> +
-</HTML> +
-{{Obrázek_7.png|fig:Obrázek_7.png}}{{Obrázek_6.png|fig:Obrázek_6.png}}+
  
-<HTML> +----
-</div> +
-</HTML>+
  
 +**If you are ready, you can continue to [[en:obc:specific_query|Lesson 6]].**
  
 +----