Multi-Modal Object Detection and Depth Estimation with Audio Queries

Overall, the study demonstrates the use of auditory perception when developing robotic systems by carefully integrating audio information in the spatial domain and, thus, harnessing the potential of auditory cues for robotic object localization and identification.

👂 + 👀 using 🤗

Added SRGAN script to rescale image 4X.

Download weights darknet53 from weights/download_weights.sh comment yolov3 and yolov3-tiny weights before downloading, if not commented.

Download the Wave2vec 2.0 fine tuned checkpoints for this task. link

Added custom torchsummary to include custom 🤗 modules.

Run single pass as:

python3 python3 train.py --data <location of maping file> --model <location of the cfg file> --image <location of darknet53 C binary> --audio <location of finetuned wave2vec 2.o checkpoints>