Annotation of Multiword Expressions
Specialized tools are being developed for the automatic identification of multiword expressions (phrasemes and collocations) in corpora.
MWE lemmatization and tagging
Starting with the SYNv14 corpus, multiword expressions are annotated in corpora using new lemmas and tags linked to the MWE database LEMUR (Lexicon of Multiword Expressions). Tagging is currently in pilot version and builds on the older phraseme annotation method using the FRANTA tool (in Czech).
Automatic tagging of MWEs has some shortcomings. First of all, it does not claim to be exhaustive, so many expressions are not included in the database. Furthermore, it is necessary to take into account that some expressions may not be found at all (for example, because their non-standard realization has not been detected), or, conversely, their use in a literal sense may be marked as a phraseme (e.g. Kocour si líže rány, které mu způsobil sousedův pes.).
Two attributes are used for the annotation: mwe_lemma and mwe_tag:
- mwe_lemma (multiword expression lemma): lemma of a MWE in the form of a dictionary entry in its basic form (nominative singular, infinitive, etc.); individual word forms are separated by an underscore, so the specific value of the mwe_lemma attribute is, for example,
bít_se_jako_lev. The entry may include multiple lexical variants of the same MWE, e.g. mwe_lemmabít_se_jako_levincludes variants bít se jako lev, rvát se jako lev and bránit se jako lev.
- mwe_tag (multiword expression tag): positional tag of a MWE consisting of 10 positions. For details see the list of mwe_tag values (in Czech).
Older method of MWE lemmatization and tagging
FRANTA tool (FRazémová ANotace a Textová Analýza ‘Phraseme annotation and text analysis’) was used for MWE annotation in the SYN corpora (versions 4-13). More detailed information is available on the specialized page (in Czech).