| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:cnk:orator [2025/06/06 13:34] – [Corpus composition and data acquisition] martinawaclawicova | en:cnk:orator [2025/06/06 13:40] (current) – [Morphological tagging of the ORATOR corpus] martinawaclawicova |
|---|
| The ORATOR v3 corpus is automatically [[en:pojmy:tag|annotated]] with [[en:cnk:syn2020#morphological_tagging|a new morphological tag]] according to the SYN2020 standard. It recognizes [[en:cnk:syn2020#multiple_lemmatization_and_tagging_aggregate|aggregates]] (e.g., //vidělas//, //zač//), uses [[en:cnk:syn2020|double-level lemmatization]], and has a verb tag ([[en:cnk:syn2020#verb_tagging_verbtag|verbtag]]). | The ORATOR v3 corpus is automatically [[en:pojmy:tag|annotated]] with [[en:cnk:syn2020#morphological_tagging|a new morphological tag]] according to the SYN2020 standard. It recognizes [[en:cnk:syn2020#multiple_lemmatization_and_tagging_aggregate|aggregates]] (e.g., //vidělas//, //zač//), uses [[en:cnk:syn2020|double-level lemmatization]], and has a verb tag ([[en:cnk:syn2020#verb_tagging_verbtag|verbtag]]). |
| |
| Substandard variants and forms typical of dialects and spontaneous speech are also tagged in the corpus. Special variants of words are distinguished by their own sublemma (e.g. //poslúchat// under the lemma //poslouchat//), special forms tagged only in the spoken corpus have the number 9 in the last tag position (e.g. the form //jezdijó// has the tag ''%%VB-P---3P-AAI-9%%''). | Substandard variants and forms typical of dialects and spontaneous speech are also tagged in the corpus (according to the ORTOFON corpus, see [[en:cnk:ortofon#morphological_tagging_of_the_ortofon_corpus|Morphological tagging of the ORTOFON corpus]]). |
| |
| The following specific tags are used in the first tag position (word type): | The following specific tags are used in the first tag position (word type): |
| Note: The anonymised sections are specified on a basic level ''%%word%%'': NP – surname, NJ – first name, NN – nickname, NM – place name, NO – other proper names, NT – last two digits of the telephone number. | Note: The anonymised sections are specified on a basic level ''%%word%%'': NP – surname, NJ – first name, NN – nickname, NM – place name, NO – other proper names, NT – last two digits of the telephone number. |
| |
| The ORAL v1, ORTOFON v1 and ORTOFON v2 corpora are tagged with the prior morphological tagset used until 2020. Detailed information on the annotation of these previously published corpora can be found on a [[en:cnk:lemtag_mluv|separate page]]. | The ORATOR v2 corpus is tagged with the prior morphological tagset used until 2020. Detailed information on the annotation of these previously published corpora can be found on a [[en:cnk:lemtag_mluv|separate page]]. |
| |
| ====== ORATOR v1 (2019) ====== | ====== ORATOR v1 (2019) ====== |