Implementation in Pytorch of the DAVEnet (Deep Audio-Visual Embedding network) model, as described in
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass, "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input," ECCV 2018
- pytorch
- torchvision
- librosa
You will need the PlacesAudio400k spoken caption corpus in addition to the Places205 image dataset:
http://groups.csail.mit.edu/sls/downloads/placesaudio/
Please follow the instructions provided in the PlacesAudio400k download package with respect to how to configure and specify the dataset .json files.
python run.py train.json --data-val val.json
Where train.json and val.json are included in the PlacesAudio400k dataset.
See the run.py script for more training options.