Skrýt
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
en:cnk:czesl-plain [2018/08/07 12:47]
Alexandr Rosen metadata missing alert
en:cnk:czesl-plain [2018/08/07 12:52] (current)
Alexandr Rosen
Line 21: Line 21:
  
  
-Texts of non-native speakers (the **ciz** part), extended by some newer texts, are available as the CzeSL-sgt corpus, together with metadata and automatically performed morphosyntactic and error annotation, including the identification of incorrect forms. The CzeSL-plain corpus is also available from the LINDAT-Clarin repository as AKCES 3 AKCES4. See also CzeSL – a Learner Corpus of Czech+Texts of non-native speakers (the **ciz** part), extended by some newer texts, are available as the [[cnk:czesl-sgt]] corpus, together with metadata and automatically performed morphosyntactic and error annotation, including the identification of incorrect forms. The **CzeSL-plain** corpus is also available from the LINDAT-Clarin repository as [[https://​lindat.mff.cuni.cz|LINDAT-Clarin]] as [[http://​hdl.handle.net/​11858/​00-097C-0000-000C-2112-B|AKCES 3]] and [[http://​hdl.handle.net/​11858/​00-097C-0000-000C-2293-0|AKCES4]]. See also [[http://​utkl.ff.cuni.cz/​learncorp/​|CzeSL – a Learner Corpus of Czech]].
- +
-Although the CzeSL-plain corpus does not contain any linguistic annotation at the moment, its next release will include more texts (the corpus is thus non-reference) and provide automatic identification of incorrect forms and morphosyntactic tags. Some of the texts included in the CzeSL-plain corpus are annotated by correct forms, error labels, morphosyntactic tags and lemmas and are due for release under a different purpose-built search interface.+
  
 The number of characters in the corpus is somewhat higher than in the original texts because of codes used in the transcription of the manuscripts and the encoding of some foreign and non-standard characters. For example, the string … represents omission (...), //&​priv;//​ indicates anonymization (of a proper name), //&​img;//​ indicates the place, where there was a picture in the manuscript, //&​unclear;//​ stands for an unrecognized word or passage, //&​rdot;//​ is a string indicating the characterr with a dot above etc. The number of characters in the corpus is somewhat higher than in the original texts because of codes used in the transcription of the manuscripts and the encoding of some foreign and non-standard characters. For example, the string … represents omission (...), //&​priv;//​ indicates anonymization (of a proper name), //&​img;//​ indicates the place, where there was a picture in the manuscript, //&​unclear;//​ stands for an unrecognized word or passage, //&​rdot;//​ is a string indicating the characterr with a dot above etc.