This is an old revision of the document!

CzeSL-SGT-basic – a corpus of non-native Czech with simplified search options

The CzeSL-SGT-basic corpus is based on the CzeSL-SGT corpus (Czech as a Second Language with Spelling, Grammar and Tags), which includes transcriptions of essays written by non-native speakers of Czech, extending the “foreign” (ciz) part of the CzeSL-plain corpus by texts collected in 2013. The difference is in options available in the search interface: CzeSL-SGT-basic offeres a reduced set of metadata items.

Word forms are tagged by word class, morphological categories and base forms (lemmas). Some forms are corrected and the resulting texts are tagged again. Original and corrected forms are compared and error labels are assigned. The annotation is assigned automatically, which necessarily results in some inaccuracy and error rate.

Most texts are equipped with metadata about the author and the text. The reduced search interface includes five author-related items – sex (doc.s_pohlavi), age category (doc.s_vek_kat), first language (doc.s_jazyk1), proficiency level (doc.s_cj_SERR) and knowledge of Czech among members of the family (doc.s_cj_v_rodine), and one text-related item – medium, i.e. manuscript or machine-readable text (doc.t_medium). The full set of metadata items is still available from the within clause of the CQL query.

The corpus is available for on-line searching using the search interface of the Czech National Corpus.

For more about the CzeSl-SGT corpus see http://utkl.ff.cuni.cz/%7Erosen/public/2014-czesl-sgt-en.pdf.

Trace: • frekvence • skript2012 • cermak • morphology1 • syn • czesl-sgt-basic