The official code repo for "Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations".
https://arxiv.org/pdf/2311.04335.pdf
Contrary to sentence embeddings, where the meaning of each sentence is represented by one fixed-length vector, sub-sentence-encoder produces distinct embeddings for different parts of a sentence.
The model is implemented with pytorch_lightning
. See requirements.txt
for the list of required packages.
We will release a native pytorch + huggingface compatible version of the model soon.
The model checkpoints of different variants of the models can be found in this google drive folder.
To load the model and tokenizer:
from model.pl_subencoder import LitSubEncoder
from transformers import AutoTokenizer
model_path = "<your-model-dir>/subencoder_st5_base.ckpt"
model = LitSubEncoder.load_from_checkpoint(model_path)
tokenizer = AutoTokenizer.from_pretrained(model.params.model_name)
Here are some example ways to use the model.
Suppose you have the following sentences, for which there are a few sub-sentence parts you want to encode. Let's say you express the sub-sentence part in the natural language form.
sentence1 = "Dracula is a novel by Bram Stoker, published in 1897."
parts1 = ["Dracula is by Bram Stoker", "Dracula is published in 1897"]
sentence2 = "Dracula – a 19th-century Gothic novel, featuring Count Dracula as the protagonist."
parts2 = ["Dracula a 19th-century novel"]
Step #1: Align the sub-sentence parts with the original sentence
from utils.label_utils import align_sub_sentence
spans1 = [align_sub_sentence(full_sent=sentence1, decomp_sent=p) for p in parts1]
spans2 = [align_sub_sentence(full_sent=sentence2, decomp_sent=p) for p in parts2]
Step #2: Tokenize the inputs (and convert the sub-sentence parts into attention mask format)
from utils.data_utils import convert_example_to_features
sent1_ex = [{
"text": sentence1,
"spans": spans1
}]
sent2_ex = [{
"text": sentence2,
"spans": spans2
}]
sent1_inputs = convert_example_to_features(batch_examples=sent1_ex, tokenizer=tokenizer, max_seq_len=128, sent1_key="text", sent1_props_key="spans")
sent2_inputs = convert_example_to_features(batch_examples=sent2_ex, tokenizer=tokenizer, max_seq_len=128, sent1_key="text", sent1_props_key="spans")
The convert_example_to_features
function returns a PropPairModelInput
that will be used as input to the model. If you need to move the input on to a different device, e.g. cuda
, you can do it via
sent1_inputs.to("cuda")
sent2_inputs.to("cuda")
Step #3: Encode the examples
sent1_embeddings = model(
encoder_inputs=sent1_inputs.encoder_inputs,
num_prop_per_sentence=sent1_inputs.num_prop_per_sentence,
all_prop_mask=sent1_inputs.all_prop_mask
)
sent2_embeddings = model(
encoder_inputs=sent2_inputs.encoder_inputs,
num_prop_per_sentence=sent2_inputs.num_prop_per_sentence,
all_prop_mask=sent2_inputs.all_prop_mask
)
Step #4: See what the cosine similarities between the sub-sentence parts are?
import torch.nn.functional as F
# Between "Dracula is by Bram Stoker" and "Dracula is published in 1897" in sentence 1
sim = F.cosine_similarity(sent1_embeddings[0], sent1_embeddings[1], dim = -1)
print(sim)
# Output: tensor(0.2909)
# Between "Dracula is published in 1897" from sentence 1 and "Dracula a 19th-century novel" from sentence 2
sim = F.cosine_similarity(sent1_embeddings[1], sent2_embeddings[0], dim = -1)
print(sim)
# Output: tensor(0.7180)
We release the T5-large model + usage instructions for proposition segmentation on Huggingface.
https://huggingface.co/sihaochen/SegmenT5-large
The train/test/validation data we used for model training/development can be found in this google drive folder.
The training split comp_sents_prop_train.jsonl
contains ~240k sentence pairs with NLI-model labeled pairs of propositions. Each example follows the format below; positive_pairs
contains the indices of propositions from sent1_props
and sent2_props
that form a positive pair with each other. sent1_props_spans
and sent2_props_spans
contains the set of character spans to the original sentence that corresponds to each proposition.
{
"pair_id": "comp_sent_30",
"sent1": "Salmond faced a total of 14 charges when he appeared at Edinburgh Sheriff Court on Thursday.",
"sent2": "Salmond was arrested Wednesday, and faces 14 charges including nine counts of sexual assault and two of attempted rape.",
"sent1_props": [
"Salmond appeared in Edinburgh Sheriff Court.",
"Salmond faced 14 charges."
],
"sent2_props": [
"Salmond was arrested on Wednesday.",
"Salmond is facing 14 charges.",
"Nine of the charges against Salmond are for sexual assault.",
"Two of the charges against Salmond are for attempted rape."
],
"sent1_props_spans": [[[0, 7], [44, 52], [56, 79]], [[0, 13], [25, 35]]],
"sent2_props_spans": [[[0, 30]], [[0, 7], [36, 52]], [[0, 7], [45, 52], [63, 67], [75, 92]], [[0, 7], [45, 52], [97, 118]]],
"positive_pairs": [[1, 1]]
}
The data formatted for retrieval task can be found in the google drive folder. It's also on Huggingface dataset under sihaochen/propsegment-retrieval.
To evaluate a sub-sentence encoder checkpoint on the retrieval task --
python scripts/evaluation/eval_retrieval.py \
--checkpoint_path <path to .ckpt file> \
--batch_size 256 \
--query_path data/propsegment_queries_all.jsonl \
--target_path data/propsegment_targets_all.jsonl \
--cuda \
To evaluate a baseline sentence encoder -- Let's say SimCSE.
python scripts/evaluation/eval_retrieval.py \
--baseline_name princeton-nlp/unsup-simcse-bert-base-uncased \
--batch_size 256 \
--query_path data/propsegment_queries_all.jsonl \
--target_path data/propsegment_targets_all.jsonl \
--cuda \
@article{chen2023subsentence,
title={Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations},
author={Sihao Chen and Hongming Zhang and Tong Chen and Ben Zhou and Wenhao Yu and Dian Yu and Baolin Peng and Hongwei Wang and Dan Roth and Dong Yu},
journal={arXiv preprint arXiv:2311.04335},
year={2023},
URL = {https://arxiv.org/pdf/2311.04335.pdf}
}