/Multimodal-Matching-on-images-and-audios

Cross-modality (visual-auditory) Metric Learning Project

Primary LanguagePython

1. Target of This Repo

This repo is set up to assist you to finish your final project of the class "Introduction to Visual-Auditory Information System". This repo mainly consists of three parts: the auditory feature extractor ( afeat_extractor), visual feature extractor (vfeat_extractor) and also a simple project demo (proj_demo) which is used to predict the similarity of the audio and silent video.

2. Code Description

2.1 Feature Extractors

afeat_extractor and vfeat_extractor are respectively used to extract the visual and auditory features of the video. Specifically, in our project, we extract the 128d auditory feature and 1024d visual feature every second, and we totally extract 120 seconds of features. Therefore, every video corresponds to 120×128 auditory feature and 120×1024 visual feature, which are respectively saved as the numpy compressed file (*.npy).

  • The audio feature is extracted by a Vgg-like CNN model (implemented in tensorflow).

  • The visual feature is extracted by the inception v3 model (implemented in pytorch).

2.2 How to use the feature extractors

Before using them to extract features, you should firstly download the pretrained vggish models and pretrained inception models, and then respectively put them under the folder "afeat_extractor/" and folder "vfeat_extractor/pretrained/".

Moreover, you should also install the required dependencies, such as pytorch and tensorflow. The detailed requirements can be found in the subfolders "afeat_extractor" and "vfeat_extractor".

2.3 Project Demo

proj_demo provides one simple example to learn the similarity metric between the 120×1024 visual feature and 120×128 auditory feature. Note: the provided demo was implemented in pytorch.

3. Dataset

The provided training dataset includes 1300 video folders, each of which contains five parts:

  • frames: containing 125 video frames, where are sampled at the rate 1 frame per second
  • *.mp4: the 125 seconds of video file without audio
  • *.wav: the 125 seconds of audio file
  • afeat.npy: the numpy compressed auditory feature (120*128)
  • vfeat.npy: the numpy compressed visual feature (120*1024)

Note: we extract 125 seconds of video and audio file just to ensure that we can obtain 120 seconds of features.    

The following pic shows one example

The total dataset containing all the five parts takes about 60GB memory, and can be downloaded through the campus network. If you only use the extracted auditory feature and visual feature, then you can download the feature-only-dataset (about 150MB) from Baidu Yun.

4. Acknowlegdements

  • The original implementation of the visual feature extractor could be found from this link.

  • The original implementation of the auditory feature extractor could be found from this link.

5. Q&A

If you have any question, just contact us through e-mails or add a new issue under this repo!