How to train the "shift" and "cam" model for sound source location?
yxixi opened this issue · 4 comments
First of all, thank you for your earlier reply! Now I've got two more questions about your great work.
I've noticed that there are three models: "shift", "cam" and "sep". To my acknowledge, the "sep" model is for source seperation, and the "cam" model is for localization. And there are pretrained-model files for these models, such as:
model_file = '../results/nets/shift/net.tf-650000'
model_file = '../results/nets/cam/net.tf-675000'
Now I wonder how to train the "shift" model and "cam" model for sound source location. Could you give the detailed method to call the training function in shift_net.py? Which dataset should I use?
Looking forward for your reply :)
Sorry if that was confusing. We trained the "shift" model videos from AudioSet: https://research.google.com/audioset/ for 650k iterations. Then, to train the CAM model we removed a spatial stride from the model and fine-tuned it for 25k more iterations (that gives it higher spatial resolution). And yes, "sep" is for the source separation. Pretrained models for both can be downloaded using the ./download_models.sh
script. Hope that helps.
Sorry if that was confusing. We trained the "shift" model videos from AudioSet: https://research.google.com/audioset/ for 650k iterations. Then, to train the CAM model we removed a spatial stride from the model and fine-tuned it for 25k more iterations (that gives it higher spatial resolution). And yes, "sep" is for the source separation. Pretrained models for both can be downloaded using the
./download_models.sh
script. Hope that helps.
@andrewowens
Thank you for your reply! What confuses me most is the question below:
I train the "sep" model like this
python -c "import sep_params, sourcesep; sourcesep.train(sep_params.full(num_gpus=3), [0, 1, 2], restore = False)"
But how to run the "shift" model? Looking forward for your reply :)
For training the shift model, the code is very similar:
python -c "import shift_params, shift_net; shift_net.train(shift_params.shift_v1(num_gpus=3), [0, 1, 2], restore = False)"
As in the source separation case, you'll have to rewrite the I/O code (my code uses TFRecordReader, but this is very space inefficient, and there are probably better ways to do it).
As for running a trained network, please see shift_example.py for an example of generating a the CAM (you'd have to slightly modify, though, to get a shifted/not-shifted prediction, if that's what you want).
For training the shift model, the code is very similar:
python -c "import shift_params, shift_net; shift_net.train(shift_params.shift_v1(num_gpus=3), [0, 1, 2], restore = False)"
As in the source separation case, you'll have to rewrite the I/O code (my code uses TFRecordReader, but this is very space inefficient, and there are probably better ways to do it).
As for running a trained network, please see shift_example.py for an example of generating a the CAM (you'd have to slightly modify, though, to get a shifted/not-shifted prediction, if that's what you want).
Got it. Thanks for your reply:)