This repo contains code written for a project completed as part of my master's thesis. The project aim was to classify a limited number of room types and 'sonify' an estimation of their reverberation. This was achieved by playing a test sound (either a click or a short voice recording) mixed with reverberation pre-set which best matched that room type according to an experienced audio engineer.
The model used is a GoogLeNet model trained on the Places365 dataset. This model and other models trained on Places365 are available to download here: https://github.com/CSAILVision/places365
Dependency | Install Guide/Notes |
---|---|
python 3.5+ | |
CUDA | https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html |
CuDNN | https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html |
caffe 1.0.0 | https://github.com/adeelz92/Install-Caffe-on-Ubuntu-16.04-Python-3 Follow steps carefully since they depend on your CUDA, CuDNN and python versions |
OpenCV | Install using pip3, not during OpenVino install. |
OpenVino 2019 R1.1 | https://software.intel.com/en-us/articles/get-started-with-neural-compute-stick Note untick OpenCV |
The audio files are assumed to be in a directory labelled wav_files/
in the top directory of this repository. In the wav_files/
directory should be two directories, clicks/
and voice/
, containing the sample sounds mixed with reverberation presets. File names are assumed to be the room type label with a .wav
extension.
scene_classification
├──scene-classification
| ├── ncs_classify.py
| ├── gpu_classify.py
| ├── test_places.py
| └── demo_ncs.sh
├── models
| ├── deploy_googlenet_places365.prototxt
| ├── googlenet_places365.caffemodel
| ├── googlenet_places365.xml
| ├── googlenet_places365.bin
| └── googlenet_places365.mapping
├── wav_files
| ├──voice
| └──clicks
├── categories_places365.txt
├── groups.csv
├── requirements.txt
└── README.md
Continuously classifies scene type from a video/camera feed. An Intel Neural Compute Stick is used to perform inference. Arguments:
-i
--input
'cam' or path to a video file
-d
--delay
number of seconds delay between inference passes and sound playback
--csv
CSV file containing class groupings. Default is groups.txt
-p
--preset_file_dir
path to directory containing preset audio files. Default is wav_files
directory
-v
--voice
Option to play reverb presets with voice. Default is to use clicks.
-m
--model
Path to an .xml file with a trained model. Default is models/googlenet_places365.xml
Continuously classifies scene type from a video/camera feed. A GPU is used to perform inference. Input is either a path to a video file or 'cam' to specify camera input.
Example execution:
python3 gpu_classify.py cam
Runs inference on an input image. Prints top 5 results from inference.
Example execution:
python3 test_places.py sampleimage.jpg
Shell script to run ncs_classify.py