Specialized tools are being developed for the automatic identification of multiword expressions (phrasemes and collocations) in corpora.
Starting with the SYNv14 corpus, multiword expressions are annotated in corpora using new lemmas and tags linked to the MWE database LEMUR (Lexicon of Multiword Expressions). Tagging is currently in pilot version and builds on the older phraseme annotation method using the FRANTA tool (in Czech).
Automatic tagging of MWEs has some shortcomings. First of all, it does not claim to be exhaustive, so many expressions are not included in the database. Furthermore, it is necessary to take into account that some expressions may not be found at all (for example, because their non-standard realization has not been detected), or, conversely, their use in a literal sense may be marked as a phraseme (e.g. Kocour si líže rány, které mu způsobil sousedův pes.).
Two attributes are used for the annotation: mwe_lemma and mwe_tag:
bít_se_jako_lev. The entry may include multiple lexical variants of the same MWE, e.g. mwe_lemma bít_se_jako_lev includes variants bít se jako lev, rvát se jako lev and bránit se jako lev.FRANTA tool (FRazémová ANotace a Textová Analýza ‘Phraseme annotation and text analysis’) was used for MWE annotation in the SYN corpora (versions 4-13). More detailed information is available on the specialized page (in Czech).