End2End sound localization model

Reference:

P. Vecchiotti, N. Ma, S. Squartini, and G. J. Brown, “END-TO-END BINAURAL SOUND LOCALISATION FROM THE RAW WAVEFORM,” in 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 345 E 47TH ST, NEW YORK, NY 10017 USA, 2019, pp. 451–455.

Only WaveLoc-GTF is implemented

Model

Training

Dataset

BRIR

Surrey binaural room impulse response (BRIR) database, including anechoic room and 4 reverberation room.

Room A B C D

RT_60(s) 0.32 0.47 0.68 0.89

DDR(dB) 6.09 5.31 8.82 6.12
Sound source

TIMIT database

Sentences per azimuth

Train Validate Evaluate

24 6 15

Multi-conditional training(MCT)

For For each reverberant room, the rest 3 reverberant rooms and anechoic room are used for training

Training curves

Evaluation

Root mean square error(RMSE) is used as the metrics of performance. For each reverberant room, the evaluation was performed 3 times to get more stable results and the test dataset was regenerated each time.

Since binaural sound is directly fed to models without extra preprocess and there may be short pulses in speech, the localization result was reported based on chunks rather than frames. Each chunk consisted of 25 consecutive frames.

My result vs. paper

Reverberant room	A	B	C	D
My result	1.5	2.0	1.4	2.7
Result in paper	1.5	3.0	1.7	3.5

Main dependencies

tensorflow-1.14
pysofa (can be installed by pip)
BasicTools (in my other repository)

Room	A	B	C	D
RT_60(s)	0.32	0.47	0.68	0.89
DDR(dB)	6.09	5.31	8.82	6.12

Train	Validate	Evaluate
24	6	15

bingo-todd/WaveLoc