/LMMS

Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings

Primary LanguagePythonOtherNOASSERTION

Language Modelling Makes Sense (LMMS)

This repository includes the code related to the "LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and Beyond" paper.

If you're interested in code for the original LMMS paper from ACL 2019, click here to move to the LMMS_ACL19 branch.

This code is designed to use the transformers package (v3.0.2), and the fairseq package (v0.9.0, only for RoBERTa models, more details in the paper).

Table of Contents

Installation

Prepare Environment

This project was developed on Python 3.6.5 from Anaconda distribution v4.6.2. As such, the pip requirements assume you already have packages that are included with Anaconda (numpy, etc.). After cloning the repository, we recommend creating and activating a new environment to avoid any conflicts with existing installations in your system:

$ git clone https://github.com/danlou/LMMS.git
$ cd LMMS
$ conda create -n LMMS python=3.6.5
$ conda activate LMMS
# $ conda deactivate  # to exit environment when done with project

Additional Packages

To install additional packages used by this project run:

pip install -r requirements.txt

The WordNet package for NLTK isn't installed by pip, but we can install it easily with:

$ python -c "import nltk; nltk.download('wordnet')"

External Data

If you want to evaluate the sense embeddings on WSD or USM, you need the WSD Evaluation Framework.

$ cd external/wsd_eval  # from repo home
$ wget http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip
$ unzip WSD_Evaluation_Framework.zip

For evaluation on the WiC dataset:

$ cd external/wic  # from repo home
$ wget https://pilehvar.github.io/wic/package/WiC_dataset.zip
$ unzip WiC_dataset.zip

Details about downloading GWCS and our WordNet subset of SID will be added soon.

If you want to represent embeddings using annotations from UWA, you must download SemCor+UWA10 from this link, extract the .zip, and place the folder in external/uwa/.

Download Sense Embeddings

You can download the main LMMS-SP embeddings we produced for the paper from figshare.

These sense embeddings should be used with the Transformer models of the same model name.

Tasks comparing or combining LMMS-SP embeddings with contextual embeddings need to also use the corresponding sets of layer weights in data/weights/ (specific to each Sense Profile).

We distribute sense embeddings as '.txt' files, in the standard GloVe format.

Place downloaded sense embeddings in data/vectors/<model_name>/.

Create Sense Embeddings

The creation of LMMS-SP sense embeddings involves a series of steps that have corresponding scripts.

Below you'll find usage descriptions for all the scripts along with the exact command to run in order to replicate the results in the paper (for albert-xxlarge-v2, as an example).

Assumes layer weights have already been determined for each sense profile. The create_sense_weights.py script can be used to convert layer performance to weights.

1. embed_annotations.py - Bootstrap sense embeddings from annotated corpora

Usage description.

$ python scripts/embed_annotations.py -h
usage: embed_annotations.py [-h] [-nlm_id NLM_ID]
                            [-sense_level {synset,sensekey}]
                            [-weights_path WEIGHTS_PATH]
                            [-eval_fw_path EVAL_FW_PATH] -dataset
                            {semcor,semcor_uwa10} [-batch_size BATCH_SIZE]
                            [-max_seq_len MAX_SEQ_LEN]
                            [-subword_op {mean,first,sum}] [-layers LAYERS]
                            [-layer_op {mean,max,sum,concat,ws}]
                            [-max_instances MAX_INSTANCES] -out_path OUT_PATH

Create sense embeddings from annotated corpora.

optional arguments:
  -h, --help            show this help message and exit
  -nlm_id NLM_ID        HF Transfomers model name (default: bert-large-cased)
  -sense_level {synset,sensekey}
                        Representation Level (default: sensekey)
  -weights_path WEIGHTS_PATH
                        Path to layer weights (default: )
  -eval_fw_path EVAL_FW_PATH
                        Path to WSD Evaluation Framework (default:
                        external/wsd_eval/WSD_Evaluation_Framework/)
  -dataset {semcor,semcor_uwa10}
                        Name of dataset (default: semcor)
  -batch_size BATCH_SIZE
                        Batch size (default: 16)
  -max_seq_len MAX_SEQ_LEN
                        Maximum sequence length (default: 512)
  -subword_op {mean,first,sum}
                        Subword Reconstruction Strategy (default: mean)
  -layers LAYERS        Relevant NLM layers (default: -1 -2 -3 -4)
  -layer_op {mean,max,sum,concat,ws}
                        Operation to combine layers (default: sum)
  -max_instances MAX_INSTANCES
                        Maximum number of examples for each sense (default:
                        inf)
  -out_path OUT_PATH    Path to resulting vector set (default: None)

Example usage:

$ python scripts/embed_annotations.py -nlm_id albert-xxlarge-v2 -sense_level sensekey -dataset semcor_uwa10 -weights_path data/weights/lmms-sp-wsd.albert-xxlarge-v2.weights.txt -layer_op ws -out_path data/vectors/sc_uwa10-sp-wsd.albert-xxlarge-v2.vectors.txt

To represent synsets instead of sensekeys, you may use the option '-sense_level synset'.

2. extend_sensekeys.py - Propagate supervised representations (from annotations) through WordNet

Usage description.

$ python scripts/extend_sensekeys.py -h
usage: extend_sensekeys.py [-h] -sup_sv_path SUP_SV_PATH
                           [-ext_mode {synset,hypernym,lexname}] -out_path
                           OUT_PATH

Propagates supervised sense embeddings through WordNet.

optional arguments:
  -h, --help            show this help message and exit
  -sup_sv_path SUP_SV_PATH
                        Path to supervised sense vectors
  -ext_mode {synset,hypernym,lexname}
                        Max abstraction level
  -out_path OUT_PATH    Path to resulting extended vector set

Example usage:

python scripts/extend_sensekeys.py -sup_sv_path data/vectors/sc_uwa10-sp-wsd.albert-xxlarge-v2.vectors.txt -ext_mode lexname -out_path data/vectors/sc_uwa10-extended-sp-wsd.albert-xxlarge-v2.vectors.txt

To extend synsets instead of sensekeys, use the extend_synsets.py script in a similar fashion.

3. embed_glosses.py - Create sense embeddings based on WordNet's glosses and lemmas

Usage description.

$ python scripts/embed_glosses.py -h
usage: embed_glosses.py [-h] [-nlm_id NLM_ID] [-sense_level {synset,sensekey}]
                        [-subword_op {mean,first,sum}] [-layers LAYERS]
                        [-layer_op {mean,sum,concat,ws}]
                        [-weights_path WEIGHTS_PATH] [-batch_size BATCH_SIZE]
                        [-max_seq_len MAX_SEQ_LEN] -out_path OUT_PATH

Creates sense embeddings based on glosses and lemmas.

optional arguments:
  -h, --help            show this help message and exit
  -nlm_id NLM_ID        HF Transfomers model name
  -sense_level {synset,sensekey}
                        Representation Level
  -subword_op {mean,first,sum}
                        Subword Reconstruction Strategy
  -layers LAYERS        Relevant NLM layers
  -layer_op {mean,sum,concat,ws}
                        Operation to combine layers
  -weights_path WEIGHTS_PATH
                        Path to layer weights
  -batch_size BATCH_SIZE
                        Batch size
  -max_seq_len MAX_SEQ_LEN
                        Maximum sequence length
  -out_path OUT_PATH    Path to resulting vector set

Example usage:

$ python scripts/embed_glosses.py -nlm_id albert-xxlarge-v2 -sense_level sensekey -weights_path data/weights/lmms-sp-wsd.albert-xxlarge-v2.weights.txt -layer_op ws -out_path data/vectors/glosses-sp-wsd.albert-xxlarge-v2.vectors.txt

To represent synsets instead of sensekeys, you may use the option '-sense_level synset'.

For a better understanding of what strings we're actually composing to generate these sense embeddings, here are a few examples:

Sensekey (sk) Embedded String (sk's lemma, all lemmas, tokenized gloss)
earth%1:17:00:: earth - Earth , earth , world , globe - the 3rd planet from the sun ; the planet we live on
globe%1:17:00:: globe - Earth , earth , world , globe - the 3rd planet from the sun ; the planet we live on
disturb%2:37:00:: disturb - disturb , upset , trouble - move deeply

4. merge_avg.py - Merging gloss and extended representations

Usage description.

$ python scripts/merge_avg.py -h
usage: merge_avg.py [-h] -v1_path V1_PATH -v2_path V2_PATH [-v3_path V3_PATH]
                    -out_path OUT_PATH

Averages and normalizes vector .txt files.

optional arguments:
  -h, --help          show this help message and exit
  -v1_path V1_PATH    Path to vector set 1
  -v2_path V2_PATH    Path to vector set 2
  -v3_path V3_PATH    Path to vector set 3. Missing vectors are imputated from
                      v2 (optional)
  -out_path OUT_PATH  Path to resulting vector set

Example usage:

$ python scripts/embed_glosses.py -v1_path data/vectors/sc_uwa10-extended-sp-wsd.albert-xxlarge-v2.vectors.txt -v2_path data/vectors/glosses-sp-wsd.albert-xxlarge-v2.vectors.txt -out_path data/vectors/lmms-sp-wsd.albert-xxlarge-v2.vectors.txt

Evaluation

Each of the 5 tasks tackled in the paper has its own evaluation script in evaluation/.

We refer to the start of each evaluation script for example usage and more details.

Demos

For easier application on downstream tasks, we also prepared demonstration files showcasing barebones applications of LMMS-SP for disambiguation and matching using WordNet.

  • demo_disambiguation.py: Loads a Transformer model, LMMS SP-WSD sense embeddings, and spaCy (for lemmatization and POS-tagging) and applies them to disambiguate particular word in an example sentence.
  • demo_matching.py: Loads a Transformer model and LMMS SP-USM sense embeddings, and applies them to match sensekeys and synsets particular word/span in an example sentence.

References

Artificial Intelligence Journal (AIJ)

Current version featuring Sense Profiles, probing analysis, and extensive evaluation (ScienceDirect, arXiv (preprint)).

@article{LOUREIRO2022103661,
title = {LMMS reloaded: Transformer-based sense embeddings for disambiguation and beyond},
journal = {Artificial Intelligence},
volume = {305},
pages = {103661},
year = {2022},
issn = {0004-3702},
doi = {https://doi.org/10.1016/j.artint.2022.103661},
url = {https://www.sciencedirect.com/science/article/pii/S0004370222000017},
author = {Daniel Loureiro and Alípio {Mário Jorge} and Jose Camacho-Collados},
keywords = {Semantic representations, Neural language models},
abstract = {Distributional semantics based on neural approaches is a cornerstone of Natural Language Processing, with surprising connections to human meaning representation as well. Recent Transformer-based Language Models have proven capable of producing contextual word representations that reliably convey sense-specific information, simply as a product of self-supervision. Prior work has shown that these contextual representations can be used to accurately represent large sense inventories as sense embeddings, to the extent that a distance-based solution to Word Sense Disambiguation (WSD) tasks outperforms models trained specifically for the task. Still, there remains much to understand on how to use these Neural Language Models (NLMs) to produce sense embeddings that can better harness each NLM's meaning representation abilities. In this work we introduce a more principled approach to leverage information from all layers of NLMs, informed by a probing analysis on 14 NLM variants. We also emphasize the versatility of these sense embeddings in contrast to task-specific models, applying them on several sense-related tasks, besides WSD, while demonstrating improved performance using our proposed approach over prior work focused on sense embeddings. Finally, we discuss unexpected findings regarding layer and model performance variations, and potential applications for downstream tasks.}
}

ACL 2019

The original LMMS paper (ACL Anthology, arXiv).

@inproceedings{loureiro-jorge-2019-language,
    title = "Language Modelling Makes Sense: Propagating Representations through {W}ord{N}et for Full-Coverage Word Sense Disambiguation",
    author = "Loureiro, Daniel  and
      Jorge, Al{\'\i}pio",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1569",
    doi = "10.18653/v1/P19-1569",
    pages = "5682--5691"
}

EMNLP 2020

Where we improve LMMS sense embeddings using automatic annotations for unambiguous words (UWA corpus) (ACL Anthology, arXiv).

@inproceedings{loureiro-camacho-collados-2020-dont,
    title = "Don{'}t Neglect the Obvious: On the Role of Unambiguous Words in Word Sense Disambiguation",
    author = "Loureiro, Daniel  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.283",
    doi = "10.18653/v1/2020.emnlp-main.283",
    pages = "3514--3520"
}

SemDeep-5 at IJCAI 2019

Application of LMMS for the Word-in-Context (WiC) Challenge (ACL Anthology, arXiv).

@inproceedings{loureiro-jorge-2019-liaad,
    title = "{LIAAD} at {S}em{D}eep-5 Challenge: Word-in-Context ({W}i{C})",
    author = "Loureiro, Daniel  and
      Jorge, Al{\'\i}pio",
    booktitle = "Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5)",
    month = aug,
    year = "2019",
    address = "Macau, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5801",
    pages = "1--5",
}