PEARL (Learning High-Quality and General-Purpose Phrase Representations)

Our PEARL is a framework to learn phrase-level representations.
If you require semantic similarity computation for strings, our PEARL model might be a helpful tool.
It offers powerful embeddings suitable for tasks like string matching, entity retrieval, entity clustering, and fuzzy join.

Model	Size	PPDB	PPDB filtered	Turney	BIRD	YAGO	UMLS	CoNLL	BC5CDR	AutoFJ	Avg
FastText	-	94.4	61.2	59.6	58.9	16.9	14.5	3.0	0.2	53.6	40.3
Sentence-BERT	110M	94.6	66.8	50.4	62.6	21.6	23.6	25.5	48.4	57.2	50.1
Phrase-BERT	110M	96.8	68.7	57.2	68.8	23.7	26.1	35.4	59.5	66.9	54.5
E5-small	34M	96.0	56.8	55.9	63.1	43.3	42.0	27.6	53.7	74.8	57.0
E5-base	110M	95.4	65.6	59.4	66.3	47.3	44.0	32.0	69.3	76.1	61.1
PEARL-small	34M	97.0	70.2	57.9	68.1	48.1	44.5	42.4	59.3	75.2	62.5
PEARL-base	110M	97.3	72.2	59.7	72.6	50.7	45.8	39.3	69.4	77.1	64.8

Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is *ms/512 samples. The FastText model here is crawl-300d-2M-subword.bin.

Model	Avg Score	Estimated Memory	Speed GPU	Speed CPU
FastText	40.3	1200MB	-	57ms
PEARL-small	62.5	68MB	42ms	446ms
PEARL-base	64.8	220MB	89ms	1394ms

Usage

Check out our model on Huggingface: 🤗 PEARL-small 🤗 PEARL-base

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]

Evaluation

We evaluate phrase embeddings on a benchmark that contains 9 datasets of 5 different tasks. 📥 Download Benchmark

-	PPDB	PPDB filtered	Turney	BIRD	YAGO	UMLS	CoNLL	BC5CDR	AutoFJ
Task	Paraphrase Classification	Paraphrase Classification	Phrase Similarity	Phrase Similarity	Entity Retrieval	Entity Retrieval	Entity Clustering	Entity Clustering	Fuzzy Join
Metric	Acc	Acc	Acc	Pearson	Top-1 Acc	Top-1 Acc	NMI	NMI	Acc

Put the downloaded eval_data/ into evaluation/ dicrectory and run the script evaluation/eval.py to get scores in our paper.

python eval.py -batch_size 8

Evaluate your custom model
You need to implement a Module class to generate embeddings given a list of texts, and then reuse the eval.py.

class PearlSmallModel(nn.Module):
    def __init__(self):
        super().__init__()
        model_name = "Lihuchen/pearl_small"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)


    def average_pool(self, last_hidden_states, attention_mask):
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
        

    def forward(self, x, device):
        # Tokenize the input texts
        batch_dict = self.tokenizer(x, max_length=128, padding=True, truncation=True, return_tensors='pt')
        batch_dict = batch_dict.to(device)

        outputs = self.model(**batch_dict)
        phrase_vec = self.average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

        return phrase_vec

Training

Download all needed training files: 📥 Download Training Files
There are five files in total:

freq_phrase.txt has more than 3M phrases
phrase_with_etype.txt has the entity label for the Phrase Type Classification
token_aug.jsonl has token-level augmentations
phrase_aug.jsonl has phrase-level augmentations
hard_negative.txt has pre-defined hard negatives

Put the downloaded files into source/train_data.

python main.py -help

Once completing the data preparation and environment setup, we can train the model via main.py.

python main.py -target_model intfloat/e5-small-v2 -dim 384

Citation

If you find our paper and code useful, please give us a citation 😊

@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}

hitxujian/PEARL

PEARL (Learning High-Quality and General-Purpose Phrase Representations)

Usage

Evaluation

Training

Citation