Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:syn2020 [2020/12/27 12:18] – [SYN2020 Corpus] michalkren | en:cnk:syn2020 [2022/06/09 13:36] (current) – [How to cite SYN2020] jankrivan |
---|
==== Multiple lemmatization and tagging (aggregate) ==== | ==== Multiple lemmatization and tagging (aggregate) ==== |
| |
In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates**, are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (//aby//, //kdyby//), the connection of words with the the enclitical form //s// (//dělalas//, //viděls//, //komus//, //vždyťs//), the connection of prepositions with some pronouns (//nač//, //očpak//, //zaň//), or a combination of words of the last two types (//načs//). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page. | In the SYN2020 corpus, **multiple lemmas and tags** for a special group of words, so-called **aggregates** ("multiword tokens" in the [[https://universaldependencies.org/|Universal Dependencies]] terminology), are newly introduced. Aggregates are words that are written as one orthographic word in Czech, but from the point of view of syntax or specification of grammatical categories they behave as two orthographic words (exceptionally three). The aggregates concern conditional conjunctions (//aby//, //kdyby//), the connection of words with the the enclitical form //s// (//dělalas//, //viděls//, //komus//, //vždyťs//), the connection of prepositions with some pronouns (//nač//, //očpak//, //zaň//), or a combination of words of the last two types (//načs//). For each of these words, two (or three) lemmas, sublemmas, tags and verbtags are specified at the same time according to their respective parts. For detailed information on aggregates, see the aggregate page. |
| |
| ==== Automatic corpus annotation ==== |
| For SYN2020, the entire annotation process is automatic. Its detailed description including the annotation accuracy and a rich bibliography to both the tools and data can be found on a [[cnk:syn2020:automaticka_anotace|dedicated page]] (Czech only). |
| |
====== How to cite SYN2020 ====== | ====== How to cite SYN2020 ====== |
<WRAP round tip 70%> | <WRAP round tip 70%> |
Křen, M. – Cvrček, V. – Henyš, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //SYN2020: reprezentativní korpus psané češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020. Dostupný z WWW: http://www.korpus.cz | Křen, M. – Cvrček, V. – Henyš, J. – Hnátková, M. – Jelínek, T. – Kocek, J. – Kováříková, D. – Křivan, J. – Milička, J. – Petkevič, V. – Procházka, P. – Skoumalová, H. – Šindlerová, J. – Škrabal, M.: //SYN2020: reprezentativní korpus psané češtiny//. Ústav Českého národního korpusu FF UK, Praha 2020. Dostupný z WWW: http://www.korpus.cz |
| |
| Jelínek, T. – Křivan, J. – Petkevič, V. – Skoumalová, H. – Šindlerová, J. (2021): [[https://doi.org/10.1007/978-3-030-83527-9_4|SYN2020: A new corpus of Czech with an innovated annotation]]. In: K. Ekštein – F. Pártl – M. Konopík (eds.), //Text, Speech, and Dialogue.// TSD 2021. Lecture Notes in Computer Science, vol. 12848. Cham: Springer, 48–59. |
| |
| Křivan, J. – Šindlerová, J. (2022): [[http://sas.ujc.cas.cz/archiv.php?lang=en&art=4508|Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu]]. //Slovo a slovesnost//, 83, 2/2022, 122–145. |
| |
</WRAP> | </WRAP> |
| |