| 📖 paper | 🤗 PEARL-small | 🤗 PEARL-base | 💾 data |
Our PEARL is a framework to learn phrase-level representations.
If you require semantic similarity computation for strings, our PEARL model might be a helpful tool.
It offers powerful embeddings suitable for tasks like string matching, entity retrieval, entity clustering, and fuzzy join.
Model | Size | PPDB | PPDB filtered | Turney | BIRD | YAGO | UMLS | CoNLL | BC5CDR | AutoFJ | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|
FastText | - | 94.4 | 61.2 | 59.6 | 58.9 | 16.9 | 14.5 | 3.0 | 0.2 | 53.6 | 40.3 |
Sentence-BERT | 110M | 94.6 | 66.8 | 50.4 | 62.6 | 21.6 | 23.6 | 25.5 | 48.4 | 57.2 | 50.1 |
Phrase-BERT | 110M | 96.8 | 68.7 | 57.2 | 68.8 | 23.7 | 26.1 | 35.4 | 59.5 | 66.9 | 54.5 |
E5-small | 34M | 96.0 | 56.8 | 55.9 | 63.1 | 43.3 | 42.0 | 27.6 | 53.7 | 74.8 | 57.0 |
E5-base | 110M | 95.4 | 65.6 | 59.4 | 66.3 | 47.3 | 44.0 | 32.0 | 69.3 | 76.1 | 61.1 |
PEARL-small | 34M | 97.0 | 70.2 | 57.9 | 68.1 | 48.1 | 44.5 | 42.4 | 59.3 | 75.2 | 62.5 |
PEARL-base | 110M | 97.3 | 72.2 | 59.7 | 72.6 | 50.7 | 45.8 | 39.3 | 69.4 | 77.1 | 64.8 |
Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is *ms/512 samples
. The FastText model here is crawl-300d-2M-subword.bin
.
Model | Avg Score | Estimated Memory | Speed GPU | Speed CPU |
---|---|---|---|---|
FastText | 40.3 | 1200MB | - | 57ms |
PEARL-small | 62.5 | 68MB | 42ms | 446ms |
PEARL-base | 64.8 | 220MB | 89ms | 1394ms |
Check out our model on Huggingface: 🤗 PEARL-small 🤗 PEARL-base
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def encode_text(model, input_texts):
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings
query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')
# encode
embeddings = encode_text(model, input_texts)
# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]
We evaluate phrase embeddings on a benchmark that contains 9 datasets of 5 different tasks. 📥 Download Benchmark
- | PPDB | PPDB filtered | Turney | BIRD | YAGO | UMLS | CoNLL | BC5CDR | AutoFJ |
---|---|---|---|---|---|---|---|---|---|
Task | Paraphrase Classification | Paraphrase Classification | Phrase Similarity | Phrase Similarity | Entity Retrieval | Entity Retrieval | Entity Clustering | Entity Clustering | Fuzzy Join |
Metric | Acc | Acc | Acc | Pearson | Top-1 Acc | Top-1 Acc | NMI | NMI | Acc |
Put the downloaded eval_data/
into evaluation/
dicrectory and run the script evaluation/eval.py
to get scores in our paper.
python eval.py -batch_size 8
Evaluate your custom model
You need to implement a Module
class to generate embeddings given a list of texts, and then reuse the eval.py
.
class PearlSmallModel(nn.Module):
def __init__(self):
super().__init__()
model_name = "Lihuchen/pearl_small"
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
def average_pool(self, last_hidden_states, attention_mask):
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def forward(self, x, device):
# Tokenize the input texts
batch_dict = self.tokenizer(x, max_length=128, padding=True, truncation=True, return_tensors='pt')
batch_dict = batch_dict.to(device)
outputs = self.model(**batch_dict)
phrase_vec = self.average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return phrase_vec
Download all needed training files: 📥 Download Training Files
There are five files in total:
freq_phrase.txt
has more than 3M phrasesphrase_with_etype.txt
has the entity label for the Phrase Type Classificationtoken_aug.jsonl
has token-level augmentationsphrase_aug.jsonl
has phrase-level augmentationshard_negative.txt
has pre-defined hard negatives
Put the downloaded files into source/train_data
.
python main.py -help
Once completing the data preparation and environment setup, we can train the model via main.py
.
python main.py -target_model intfloat/e5-small-v2 -dim 384
If you find our paper and code useful, please give us a citation 😊
@article{chen2024learning,
title={Learning High-Quality and General-Purpose Phrase Representations},
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
journal={arXiv preprint arXiv:2401.10407},
year={2024}
}