This is my implementation of Cross-Modal retrieval using Collective Matrix Factorization Hashing (CMFH) originally described in link. CMFH helps us to generate unified embeddings for different modes of data in such a way that similar semantic data is nearer ( For eg, a video and its corresponding text being nearer in the common embedding space). In this repo, we demonstrate training and testing for video-text and text-video retrievals on MSR-VTT-10K dataset.
To train CMFH from scratch follow these steps:
- Since, generation of feature matrices is a time comsuming task, you may want to download the precomputed feature matrices X1 for training videos ,X2 for corresponding annotated texts, X1_test for test videos and X2_test for corresponding annotated texts and put it in feature_matrices folder. Feature matrices are of dimension (d * n) where where d is the length of embedding a single video or a text and n is the number of samples in training set.
For folks willing to generate feature matrices themselves, can download the train videos from here and put them in respective folder.
- Run the following command
python train.py
For folks , just interested in testing the joint embeddings for video-text and text-video retrievals can follow these steps:
- Download pre-trained projection matrices P1 from here and P2 from here and save them as P1.npy and P2.npy respectively in projection_matrices folder.
- Run the cells in test notebook as instructed to run the webapp and play around by entering YouTube IDs of smaller videos(< 1 min ) and get the matching texts from MSR-VTT-10K training data texts or enter a sentence and get top 10 relevant video YouTube URLs from MSR-VTT-10K training videos.