Annotation of Multiword Expressions

Specialized tools are being developed for the automatic identification of multiword expressions (phrasemes and collocations) in corpora.

MWE lemmatization and tagging

Starting with the SYNv14 corpus, multiword expressions are annotated in corpora using new lemmas and tags linked to the MWE database LEMUR (Lexicon of Multiword Expressions). Tagging is currently in pilot version and builds on the older phraseme annotation method using the FRANTA tool (in Czech).

Automatic tagging of MWEs has some shortcomings. First of all, it does not claim to be exhaustive, so many expressions are not included in the database. Furthermore, it is necessary to take into account that some expressions may not be found at all (for example, because their non-standard realization has not been detected), or, conversely, their use in a literal sense may be marked as a phraseme (e.g. Kocour si líže rány, které mu způsobil sousedův pes.).

Two attributes are used for the annotation: mwe_lemma and mwe_tag:

Older method of MWE lemmatization and tagging

FRANTA tool (FRazémová ANotace a Textová Analýza ‘Phraseme annotation and text analysis’) was used for MWE annotation in the SYN corpora (versions 4-13). More detailed information is available on the specialized page (in Czech).