Although EgoVLMs have been pretrained on millions of worldwide egocentric videos and applied to challenging downstream tasks like video-text retrieval, we observe that they often fail to select the matched sentence from the simplest word substituted candidates for videos.
(1) We develop EgoHOIBench, a novel benchmark specifically designed to evaluate EgoVLMs' capabilities in understanding variations of HOI combination.
(2) We propose EgoNCE++, an innovative HOI-aware asymmetric contrastive learning objective for egocentric video-language pretraining.
(3) Our experimental results demonstrate the versatility and efficacy of EgoNCE++, notably enhancing performance across three EgoVLMs and improving generalization on seven downstream EgoHOI tasks.
EgoNCE++ can be implemented with ease.
class EgoNCEpp(nn.Module):
def __init__(self, temperature=0.05):
super().__init__()
self.temperature = temperature
def forward(self, video_embeds, pos_txt, neg_txt=None, n_embeds=None):
pos_sim = sim_matrix(pos_txt, video_embeds)
++ if neg_txt is not None and n_embeds is not None:
++ sim_n = n_embeds @ n_embeds.T
++ neg_sim = sim_matrix(neg_txt, video_embeds)
++ return self.get_loss(pos_sim, neg_sim, sim_n)
++ else:
return self.get_loss(pos_sim)
def get_loss(self, pos_sim, neg_sim=None, mask_n=None):
'''
inputs:
pos_sim: is similarity matrix of N x N, computed using the cosine similarity between normalised vectors
neg_sim: is similarity matrix of (Neg_Number)*N x N
'''
mask = torch.eye(pos_sim.shape[0]).cuda()
# text-to-video object-centric positive sampling
++ mask = mask_n + mask
i_sm = F.softmax(pos_sim / self.temperature, dim=1)
mask_bool = mask > 0
i_mask = torch.zeros(i_sm.shape).to(mask_bool.device) + 1e-6
i_mask[:mask_bool.shape[0], :mask_bool.shape[1]] = mask_bool
idiag = torch.log(torch.sum(i_sm * i_mask, dim=1))
loss_t2v = idiag.sum() / len(idiag)
# video-to-text HOI-aware negative generation
++ if neg_sim is not None:
++ neg_num, video_num = neg_sim.shape[0] // neg_sim.shape[1], neg_sim.shape[1]
++ neg_sim = torch.stack([neg_sim[i * neg_num: (i + 1) * neg_num, i] for i in range(video_num)], dim=1)
++ pos_sim = torch.cat([pos_sim, neg_sim], dim=0)
j_logsm = F.log_softmax(pos_sim.t()/self.temperature, dim=1)
jdiag = torch.diag(j_logsm)
loss_v2t = jdiag.sum() / len(jdiag)
return - loss_t2v - loss_v2t
The environment depends on the pretrained EgoVLM, see corresponding installation docs.
The dataset preparation details are provided in in DATASET.md.
Our annotations from Ego4D are outlined here.
We continue to pretrain the EgoVLMs (e.g. EgoVLP, EgoVLPv2, LaViLa) on the EgoHOI2.5M. The Different types of negatives can be found in EgoHOI2.5M-anonymous
EgoHOIBench includes ~29K videos and ~609K text options.
The annotations can be found in EgoHOIBench-anonymous
We provide our training log of EgoVLMs under EgoVLP_train_log, LaViLa_train_log, EgoVLPv2_train_log
MODEL++ denotes using EgoNCE++ to continue to pretrain the original MODEL.
Overview of experimental results:
We are grateful for the following projects: EgoVLP, EgoVLPv2, LaViLa, where we build our EgoNCE++ upon.