LAMA ia a probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
The dataset for the LAMA probe is available at https://dl.fbaipublicfiles.com/LAMA/data.zip
LAMA contains a set of connectors to pretrained language models.
LAMA exposes a transparent and unique interface to use:
- Transformer-XL (Dai et al., 2019)
- BERT (Devlin et al., 2018)
- ELMo (Peters et al., 2018)
- GPT (Radford et al., 2018)
- RoBERTa (Liu et al., 2019)
Actually, LAMA is also a beautiful animal.
The LAMA probe is described in the following paper:
@inproceedings{petroni2019language,
title={Language Models as Knowledge Bases?},
author={F. Petroni, T. Rockt{\"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},
booktitle={To Appear in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019},
year={2019}
}
Preprint version: https://arxiv.org/abs/1909.01066
To reproduce our results:
(optional) It might be a good idea to use a separate conda environment. It can be created by running:
conda create -n lama37 -y python=3.7 && conda activate lama37
pip install -r requirements.txt
wget https://dl.fbaipublicfiles.com/LAMA/data.zip
unzip data.zip
rm data.zip
Install spacy model
python3 -m spacy download en
Download the models
chmod +x download_models.sh
./download_models.sh
The script will create and populate a pre-trained_language_models folder. If you are interested in a particular model please edit the script.
python scripts/run_experiments.py
results will be logged in output/ and last_results.csv.
and use the vectors in your downstream task!
pip install -e git+https://github.com/facebookresearch/LAMA#egg=LAMA
import argparse
from lama.build_encoded_dataset import encode, load_encoded_dataset
PARAMETERS= {
"lm": "bert",
"bert_model_name": "bert-large-cased",
"bert_model_dir":
"pre-trained_language_models/bert/cased_L-24_H-1024_A-16",
"bert_vocab_name": "vocab.txt",
"batch_size": 32
}
args = argparse.Namespace(**PARAMETERS)
sentences = [
["The cat is on the table ."], # single-sentence instance
["The dog is sleeping on the sofa .", "He makes happy noises ."], # two-sentence
]
encoded_dataset = encode(args, sentences)
print("Embedding shape: %s" % str(encoded_dataset[0].embedding.shape))
print("Tokens: %r" % encoded_dataset[0].tokens)
# save on disk the encoded dataset
encoded_dataset.save("test.pkl")
# load from disk the encoded dataset
new_encoded_dataset = load_encoded_dataset("test.pkl")
print("Embedding shape: %s" % str(new_encoded_dataset[0].embedding.shape))
print("Tokens: %r" % new_encoded_dataset[0].tokens)
You should use the symbol [MASK]
to specify the gap.
Only single-token gap supported - i.e., a single [MASK]
.
python lama/eval_generation.py \
--lm "bert" \
--t "The cat is on the [MASK]."
Note that you could use this functionality to answer cloze-style questions, such as:
python lama/eval_generation.py \
--lm "bert" \
--t "The theory of relativity was developed by [MASK] ."
Clone the repo
git clone git@github.com:facebookresearch/LAMA.git && cd LAMA
Install as an editable package:
pip install --editable .
If you get an error in mac os x, please try running this instead
CFLAGS="-Wno-deprecated-declarations -std=c++11 -stdlib=libc++" pip install --editable .
Option to indicate which language model(s) to use:
- --language-models/--lm : comma separated list of language models (REQUIRED)
BERT pretrained models can be loaded both: (i) passing the name of the model and using huggingface cached versions or (ii) passing the folder containing the vocabulary and the PyTorch pretrained model (look at convert_tf_checkpoint_to_pytorch in here to convert the TensorFlow model to PyTorch).
- --bert-model-dir/--bmd : directory that contains the BERT pre-trained model and the vocabulary
- --bert-model-name/--bmn : name of the huggingface cached versions of the BERT pre-trained model (default = 'bert-base-cased')
- --bert-vocab-name/--bvn : name of vocabulary used to pre-train the BERT model (default = 'vocab.txt')
- --roberta-model-dir/--rmd : directory that contains the RoBERTa pre-trained model and the vocabulary (REQUIRED)
- --roberta-model-name/--rmn : name of the RoBERTa pre-trained model (default = 'model.pt')
- --roberta-vocab-name/--rvn : name of vocabulary used to pre-train the RoBERTa model (default = 'dict.txt')
- --elmo-model-dir/--emd : directory that contains the ELMo pre-trained model and the vocabulary (REQUIRED)
- --elmo-model-name/--emn : name of the ELMo pre-trained model (default = 'elmo_2x4096_512_2048cnn_2xhighway')
- --elmo-vocab-name/--evn : name of vocabulary used to pre-train the ELMo model (default = 'vocab-2016-09-10.txt')
- --transformerxl-model-dir/--tmd : directory that contains the pre-trained model and the vocabulary (REQUIRED)
- --transformerxl-model-name/--tmn : name of the pre-trained model (default = 'transfo-xl-wt103')
- --gpt-model-dir/--gmd : directory that contains the gpt pre-trained model and the vocabulary (REQUIRED)
- --gpt-model-name/--gmn : name of the gpt pre-trained model (default = 'openai-gpt')
options:
- --text/--t : text to compute the generation for
- --i : interactive mode
one of the two is required
example considering both BERT and ELMo:
python lama/eval_generation.py \
--lm "bert,elmo" \
--bmd "pre-trained_language_models/bert/cased_L-24_H-1024_A-16/" \
--emd "pre-trained_language_models/elmo/original/" \
--t "The cat is on the [MASK]."
example considering only BERT with the default pre-trained model, in an interactive fashion:
python lamas/eval_generation.py \
--lm "bert" \
--i
python lama/get_contextual_embeddings.py \
--lm "bert,elmo" \
--bmn bert-base-cased \
--emd "pre-trained_language_models/elmo/original/"
If the module cannot be found, preface the python command with PYTHONPATH=.
If the experiments fail on GPU memory allocation, try reducing batch size.
- https://github.com/huggingface/pytorch-pretrained-BERT
- https://github.com/allenai/allennlp
- https://github.com/pytorch/fairseq
-
(Dai et al., 2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdi. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860.
-
(Peters et al., 2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL-HLT 2018
-
(Devlin et al., 2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
-
(Radford et al., 2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
-
(Liu et al., 2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
LAMA is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.