Next revision | Previous revision |
en:cnk:czesl-sgt-basic [2019/10/31 19:15] – created alexandrrosen | en:cnk:czesl-sgt-basic [2019/10/31 19:40] (current) – alexandrrosen |
---|
====== CzeSL-SGT-basic – a corpus of non-native Czech with simplified search options ====== | ====== CzeSL-SGT-basic – a corpus of non-native Czech with simplified search options ====== |
| |
The CzeSL-SGT corpus (//**Cze**ch as a **S**econd **L**anguage with **S**pelling, **G**rammar and **T**ags//) includes transcriptions of essays written by non-native speakers of Czech, extending the “foreign” (ciz) part of the [[cnk:CzeSL-plain]] corpus by texts collected in 2013. | The CzeSL-SGT-basic corpus is based on the CzeSL-SGT corpus (//**Cze**ch as a **S**econd **L**anguage with **S**pelling, **G**rammar and **T**ags//), which includes transcriptions of essays written by non-native speakers of Czech, extending the “foreign” (ciz) part of the [[cnk:CzeSL-plain]] corpus by texts collected in 2013. The difference is in options available in the search interface: CzeSL-SGT-basic offers a reduced set of metadata items. |
| |
Word forms are tagged by word class, morphological categories and base forms (lemmas). Some forms are corrected and the resulting texts are tagged again. Original and corrected forms are compared and error labels are assigned. The annotation is assigned automatically, which necessarily results in some inaccuracy and error rate. | Word forms are tagged by word class, morphological categories and base forms (lemmas). Some forms are corrected and the resulting texts are tagged again. Original and corrected forms are compared and error labels are assigned. The annotation is assigned automatically, which necessarily results in some inaccuracy and error rate. |
| |
Most texts are equipped with metadata about the author and the text. The search interface includes five author-related items – sex (''doc.s_pohlavi''), age category (''doc.s_vek_kat''), first language (''doc.s_jazyk1''), proficiency level (''doc.s_cj_SERR'') and knowledge of Czech among members of the family (''doc.s_cj_v_rodine''), and one text-related item – medium, i.e. manuscript or machine-readable text (''doc.t_medium''). The full set of metadata items is still available from the ''within'' clause of the CQL query. | Most texts are equipped with metadata about the author and the text. The reduced search interface includes five author-related items – sex (''doc.s_pohlavi''), age category (''doc.s_vek_kat''), first language (''doc.s_jazyk1''), proficiency level (''doc.s_cj_SERR'') and knowledge of Czech among members of the family (''doc.s_cj_v_rodine''), and one text-related item – medium, i.e. manuscript or machine-readable text (''doc.t_medium''). The full set of metadata items is still available from the ''within'' clause of the CQL query. |
| |
The corpus is available for on-line searching using the [[http://www.korpus.cz/kontext|search interface]] of the Czech National Corpus. | The corpus is available for on-line searching using the [[http://www.korpus.cz/kontext|search interface]] of the Czech National Corpus. |