Cross-Modal-Retrieval-using-CMFH

This is my implementation of Cross-Modal retrieval using Collective Matrix Factorization Hashing (CMFH) originally described in link. CMFH helps us to generate unified embeddings for different modes of data in such a way that similar semantic data is nearer ( For eg, a video and its corresponding text being nearer in the common embedding space). In this repo, we demonstrate training and testing for video-text and text-video retrievals on MSR-VTT-10K dataset.

To train CMFH from scratch follow these steps:

  1. Since, generation of feature matrices is a time comsuming task, you may want to download the precomputed feature matrices X1 for training videos ,X2 for corresponding annotated texts, X1_test for test videos and X2_test for corresponding annotated texts and put it in feature_matrices folder. Feature matrices are of dimension (d * n) where where d is the length of embedding a single video or a text and n is the number of samples in training set.

For folks willing to generate feature matrices themselves, can download the train videos from here and put them in respective folder.

  1. Run the following command

python train.py

For folks , just interested in testing the joint embeddings for video-text and text-video retrievals can follow these steps:

  1. Download pre-trained projection matrices P1 from here and P2 from here and save them as P1.npy and P2.npy respectively in projection_matrices folder.
  2. Run the cells in test notebook as instructed to run the webapp and play around by entering YouTube IDs of smaller videos(< 1 min ) and get the matching texts from MSR-VTT-10K training data texts or enter a sentence and get top 10 relevant video YouTube URLs from MSR-VTT-10K training videos.