
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory. CVPR 2023.

Narrations-as-Queries (NaQ)

This repository contains the official PyTorch implementation for our CVPR 2023 paper:

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Santhosh Kumar Ramakrishnan1        Ziad Al-Halah2        Kristen Grauman1,3
1The University of Texas at Austin        2University of Utah        3FAIR, Meta AI
Project website: http://vision.cs.utexas.edu/projects/naq


Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (freeform text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the stateof-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.



  • Clone this repository.

    git clone https://github.com/srama2512/NaQ.git
    export NAQ_ROOT=<PATH to cloned NaQ repository>
  • Create a conda environment.

    conda create --name naq python=3.8.5
    conda activate naq
  • Install Pytorch. While our experiments use cuda 11.3, we expect other supported versions to work as well.

    conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
  • Install other dependencies.

    cd $NAQ_ROOT; pip install -r requirements.txt
  • Create a setup script ~/enable_naq.sh with the content below (and set appropriate paths). This sets up environment variables for NaQ experiments.

    # Add anaconda path
    # Activate conda environment
    source activate naq
    # Add cuda, cudnn paths
    export CUDA_HOME="<PATH TO CUDA-11.3>"
    export CUDNN_PATH="<PATH TO CUDNN compatibile with CUDA-11.3>"
    export CUDNN_INCLUDE_DIR="$CUDNN_PATH/include"
    export CUDNN_LIBRARY="$CUDNN_PATH/lib"
    export CUDACXX="$CUDA_HOME/bin/nvcc"

Dataset setup

  • Download v1 version of the Ego4D episodic memory annotations following the official instructions and copy them to $NAQ_ROOT/data/(nlq|vq|moments)_*.json. For experiments on TaCOS, we provide reformatted TaCOS annotations here compatible for NLQ training. Download them to $NAQ_ROOT/data/tacos_*.json.

  • Prepare the NaQ datasets following instructions here.

  • Download video features for all clips used in the experiments. These are computed using the official checkpoints released for each method.

    Features Destination Description Dim Size
    SlowFast $NAQ_ROOT/data/features/slowfast 8x8 R101 backbone trained on Kinetics 400 2304 160G
    EgoVLP $NAQ_ROOT/data/features/egovlp TimeSformer backbone trained on EgoClip 256 6G
    InternVideo $NAQ_ROOT/data/features/internvideo D+A prefusion of VideoMAE backbones trained on Ego4D 2304 41G
    CLIP $NAQ_ROOT/data/features/clip ViT-B/16 backbone pre-trained using CLIP 512 6G

    We provide a downloader to download, extract, and move features to the corresponding destinations:

    python utils/download_features.py --feature_types <FEAT_TYPE_1> <FEAT_TYPE_2> ...

    where FEAT_TYPE can be slowfast, egovlp, clip or internvideo. Based on our experiments, we recommend using InternVideo features for Ego4D and SlowFast features for TaCOS to get the best results

Benchmarking models on NLQ

We perform NaQ training in two stages: (1) Jointly train on NLQ+NaQ dataset with large-batch training, and (2) Finetune on NLQ dataset with standard VSLNet training. We show an example below to benchmark models on the Ego4D NLQ dataset with EgoVLP features.


Stage 1: Joint training on NLQ+NaQ dataset

bash VSLNet/scripts/train_naq.sh 0,1,2,3 nlq egovlp experiments/vslnet/egovlp/naq_joint_training 2.5

Stage 2: Fine-tune best checkpoint from stage-1 on NLQ dataset

bash VSLNet/scripts/finetune.sh 0 nlq egovlp experiments/vslnet/egovlp/nlq_finetuning 0.0001 $PRETRAINED_CKPT


bash VSLNet/scripts/infer.sh 0 nlq test egovlp experiments/vslnet/egovlp/nlq_finetuning

For participating in the Ego4D NLQ challenge, submit the inferred predictions at experiments/vslnet/egovlp/nlq_finetuning/checkpoints/vslnet_nlq_official_v1_egovlp_128_bert/model/<checkpoint_id>_test_result.json.

ReLER training

Stage 1: Joint training on NLQ+NaQ dataset

bash ReLER/scripts/train_naq.sh 0,1,2,3,4,5,6,7 nlq egovlp experiments/reler/egovlp/naq_joint_training 2.5

Stage 2: Fine-tune best checkpoint from stage-1 on NLQ dataset

bash ReLER/scripts/finetune.sh 0 nlq egovlp experiment/reler/egovlp/nlq_finetuning 0.00001 $PRETRAINED_CKPT


bash ReLER/scripts/infer.sh 0 test egovlp experiments/reler/egovlp/nlq_finetuning/video_tef-vlen600_egovlp/model_<checkpoint_id>.t7

For participating in the Ego4D NLQ challenge, submit the inferred predictions at experiments/reler/egovlp/nlq_finetuning/video_tef-vlen600_egovlp/preds/<checkpoint_id>_test_preds.json.

To train on SlowFast / InternVideo features, replace egovlp with slowfast or internvideo above. To train on TaCOS, replace nlq with tacos.

Pretrained models

We provide models pretrained using NaQ for different combinations of architectures and features here. These checkpoints can be used to reproduce results from the paper.

Ego4D NLQ 2023 challenge

  • Download v2 version of the Ego4D episodic memory annotations following the official instructions and copy them to $NAQ_ROOT/data/(nlq|vq|moments)_<SPLIT>_v2.json.

  • Download the pre-created NaQ dataset for v2 version following instructions here. This should have already been downloaded if you followed option 1. Alternatively, perform follow the next steps.

    • Convert EgoClip narrations to NaQ dataset.
      cd $NAQ_ROOT
      python utils/create_naq_dataset.py --type nlq --em_version v2
    • Prepare datasets for NLQ training.
      cd $NAQ_ROOT
      # Prepare NLQ dataset
      python utils/prepare_ego4d_dataset.py \
          --input_train_split data/nlq_train_v2.json \
          --input_val_split data/nlq_val_v2.json \
          --input_test_split data/nlq_test_unannotated_v2.json \
          --output_save_path data/dataset/nlq_official_v2
      # Prepare NLQ + NaQ dataset
      python utils/prepare_ego4d_dataset.py \
          --input_train_split data/nlq_aug_naq_train_v2.json \
          --input_val_split data/nlq_val_v2.json \
          --input_test_split data/nlq_test_unannotated_v2.json \
          --output_save_path data/dataset/nlq_aug_naq_official_v2
    • The video features for the challenge clips are included in the files downloaded in the earlier section.
  • We found that the combination of ReLER architecture + InternVideo features + NaQ augmentation performed best on the v2 validation split (we call this NaQ++). This combination is set as the official baseline for the Ego4D NLQ 2023 challenge.

    Stage 1: Joint training on NLQ+NaQ dataset

    cd $NAQ_ROOT
    bash ReLER/scripts/train_naq.sh 0,1,2,3,4,5,6,7 nlq_v2 internvideo experiments/challenge_2023/reler_internvideo/naq_joint_training 5.0

    Stage 2: Fine-tune best checkpoint from stage-1 on NLQ dataset

    cd $NAQ_ROOT
    bash ReLER/scripts/finetune.sh 0 nlq_v2 internvideo experiments/challenge_2023/reler_internvideo/nlq_finetuning 0.0001 $PRETRAINED_CKPT


    cd $NAQ_ROOT
    bash ReLER/scripts/infer.sh 0 test internvideo experiments/challenge_2023/reler_internvideo/nlq_finetuning/video_tef-vlen600_internvideo/model_<checkpoint_id>.t7

    Submission: Submit the inferred predictions at experiments/challenge_2023/reler_internvideo/nlq_finetuning/video_tef-vlen600_internvideo/preds/<checkpoint_id>_test_preds.json.

    Pretrained-models: We provide pre-trained weights for NaQ++ below. Note that we also include a checkpoint for the VSLNet architecture.

    Val MR@1 Val MR@5 Test MR@1 Test MR@5
    NaQ++ (VSLNet) 16.02 28.46 13.96 21.78
    NaQ++ (ReLER) 18.75 24.24 17.67 20.72


