/TS-TalkNet

INTERSPEECH2023: Target Active Speaker Detection with Audio-visual Cues

Primary LanguagePython

TS-TalkNet

Target Active Speaker Detection with Audio-visual Cues
Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li
NUS; CUHK
INTERSPEECH 2023

image

The overview framework of our TS-TalkNet. It consists of a feature representation frontend and a speaker detection backend classifier. The feature representation frontend includes audio and visual temporal encoders, and speaker encoder. The speaker detection backend comprises a cross-attention and a fusion module to combine the audio, visual and speaker embeddings, and a self-attention module to predict the ASD scores. The lock represents the speaker encoder is frozen in our framework.

TS-TalkNet in AVA-Activespeaker dataset

Data preparation

I follow the same prepocessing for AVA dataset as TalkNet. The details can be found in here.

The following script can be used to download and prepare the AVA dataset for training.

python train.py --dataPathAVA AVADataPath --download 

AVADataPath is the folder you want to save the AVA dataset and its preprocessing outputs

Face-speaker library

You should run the data_prep/face_enroll_speech.py file to construct the face-speaker library and save the enrollement audios for target speaker to 'enrolPath'.

The face recognition model we used is a ResNet50-Glint model trained on the Glint360K dataset. The pretrained V-Glint.model you can find here.

Training

Then you can train TalkNet in AVA end-to-end by using:

python train.py --dataPathAVA AVADataPath --enroll_speech_folder enrolPath

exps/exps1/score.txt: output score file, exps/exp1/model/model_00xx.model: trained model,

Citation

Please cite the following if our paper or code is helpful to your research.

@inproceedings{jiang2023target,
  title={Target Active Speaker Detection with Audio-visual Cues},
  author={Jiang, Yidi and Tao, Ruijie and Pan, Zexu and Li, Haizhou},
  booktitle={Proc. Interspeech},
  year={2023}
}

@inproceedings{tao2021someone,
  title={Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection},
  author={Tao, Ruijie and Pan, Zexu and Das, Rohan Kumar and Qian, Xinyuan and Shou, Mike Zheng and Li, Haizhou},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  pages = {3927–3935},
  year={2021}
}