Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?

Although EgoVLMs have been pretrained on millions of worldwide egocentric videos and applied to challenging downstream tasks like video-text retrieval, we observe that they often fail to select the matched sentence from the simplest word substituted candidates for videos.

Overview
Installation
Datasets
Training
Models

Overview

(1) We develop EgoHOIBench, a novel benchmark specifically designed to evaluate EgoVLMs' capabilities in understanding variations of HOI combination.

(2) We propose EgoNCE++, an innovative HOI-aware asymmetric contrastive learning objective for egocentric video-language pretraining.

(3) Our experimental results demonstrate the versatility and efficacy of EgoNCE++, notably enhancing performance across three EgoVLMs and improving generalization on seven downstream EgoHOI tasks.

Implementation Difference between EgoNCE++ and InfoNCE

EgoNCE++ can be implemented with ease.

class EgoNCEpp(nn.Module):
    def __init__(self, temperature=0.05):
        super().__init__()
        self.temperature = temperature
    def forward(self, video_embeds, pos_txt, neg_txt=None, n_embeds=None):
        pos_sim = sim_matrix(pos_txt, video_embeds)
++      if neg_txt is not None and n_embeds is not None:
++          sim_n = n_embeds @ n_embeds.T
++          neg_sim = sim_matrix(neg_txt, video_embeds)
++          return self.get_loss(pos_sim, neg_sim, sim_n)
++      else:
            return self.get_loss(pos_sim)
    def get_loss(self, pos_sim, neg_sim=None, mask_n=None):
        '''
        inputs:
            pos_sim: is similarity matrix of N x N, computed using the cosine similarity between normalised vectors
            neg_sim: is similarity matrix of (Neg_Number)*N x N
        '''
        mask = torch.eye(pos_sim.shape[0]).cuda()
        # text-to-video object-centric positive sampling
++      mask = mask_n + mask

        i_sm = F.softmax(pos_sim / self.temperature, dim=1)
        mask_bool = mask > 0
        i_mask = torch.zeros(i_sm.shape).to(mask_bool.device) + 1e-6
        i_mask[:mask_bool.shape[0], :mask_bool.shape[1]] = mask_bool
        idiag = torch.log(torch.sum(i_sm * i_mask, dim=1))
        loss_t2v = idiag.sum() / len(idiag)

        # video-to-text HOI-aware negative generation
++      if neg_sim is not None:
++          neg_num, video_num = neg_sim.shape[0] // neg_sim.shape[1], neg_sim.shape[1]
++          neg_sim = torch.stack([neg_sim[i * neg_num: (i + 1) * neg_num, i] for i in range(video_num)], dim=1)

++          pos_sim = torch.cat([pos_sim, neg_sim], dim=0)
        j_logsm = F.log_softmax(pos_sim.t()/self.temperature, dim=1)
        jdiag = torch.diag(j_logsm)
        loss_v2t = jdiag.sum() / len(jdiag)

        return - loss_t2v - loss_v2t

Installation

The environment depends on the pretrained EgoVLM, see corresponding installation docs.

Datasets

The dataset preparation details are provided in in DATASET.md.

Our annotations from Ego4D are outlined here.

EgoHOI2.5M

We continue to pretrain the EgoVLMs (e.g. EgoVLP, EgoVLPv2, LaViLa) on the EgoHOI2.5M. The Different types of negatives can be found in EgoHOI2.5M-anonymous

EgoHOIBench

EgoHOIBench includes ~29K videos and ~609K text options.

The annotations can be found in EgoHOIBench-anonymous

Training

We provide our training log of EgoVLMs under EgoVLP_train_log, LaViLa_train_log, EgoVLPv2_train_log

Models

MODEL++ denotes using EgoNCE++ to continue to pretrain the original MODEL.

Overview of experimental results:

Acknowledgement

We are grateful for the following projects: EgoVLP, EgoVLPv2, LaViLa, where we build our EgoNCE++ upon.

xuboshen/EgoNCEpp