/Jointly-Discovering-Visual-Objects-and-Spoken-Words

an implementation for paper Jointly Discovering Visual Objects and Spoken Words

Primary LanguagePython

Jointly-Discovering-Visual-Objects-and-Spoken-Words

Requirement

Python 3.6, Tensorflow 1.8, wavio, python_speech_features

How to run:

1) download flickr8k speech caption files and image files
2) In the data folder, flickr8k.pkl provides paired information. Details of how to use this pickle file can be found in main_SISA or MISA python file.

3) python main_SISA/MISA.py

Experiment

Speech captions retrieve images for Flickr8k dataset:

this result is on test dataset, which is the last 1000 images and captions

R@1: 0.027, R@5: 0.127, R@10:0.245
Note: still working in progress

TODO list

1) image to caption retrieval
2) ...