This repository contains the code adapted from TalkNet, an active speaker detection model to detect 'whether the face in the screen is speaking or not?'. For more details, please refer to [Paper] [Video_English] [Video_Chinese].
Start from building the environment
sudo apt-get install ffmpeg
conda create -n TalkNet python=3.7.9 anaconda
conda activate TalkNet
pip install -r requirement.txt
Start from the existing environment
pip install -r requirement.txt
Download data manifest (manifest.csv
) and annotations (av_{train/val/test_unannotated}.json
) for audio-visual diarization benchmark following the Ego4D download instructions.
Note: the default folder to save videos and annotations is ./data
, please create symbolic links in ./data
if you save them in another directory. The structure should be like this:
data/
- csv/
- manifest.csv
- json/
- av_train.json
- av_val.json
- av_test_unannotated.json
- split/
- test.list
- train.list
- val.list
- full.list
- videos/
- 00407bd8-37b4-421b-9c41-58bb8f141716.mp4
- 007beb60-cbab-4c9e-ace4-f5f1ba73fccf.mp4
- ...
Run the following script to download videos and generate clips:
python utils/download_clips.py
Run the following scripts to preprocess the videos and annotations:
bash scripts/extract_frame.sh
bash scripts/extract_wave.sh
python utils/preprocessing.py
Then you can train TalkNet on Ego4s using:
python trainTalkNet.py
The results will be saved in exps/exp
:
exps/exp/score.txt
: output score file
exps/exp/model/model_00xx.model
: trained model
exps/exps/val_res.csv
: prediction for val set.
The model pretrained on AVA will automatically be downloaded into data/pretrain_AVA.model
.
Our model trained on Ego4d performs ACC:79.27%
on test set.
We can predict active speakers for each person given the face tracks. Please put the tracking results in ./data/track_results
. The structure should be like this:
data/
- track_results/
- results/
- 0.txt
- 1.txt
- ...
- v.txt
- results/
Run the following script to make the tracking results compatible with dataloader (specify subset from ['full', 'val', 'test']
):
python utils/process_tracking_result.py --evalDataType ${SUBSET}
Run the following script, specify the checkpoint and subset:
python inferTalkNet.py --checkpoint ${MODEL_PATH} --evalDataType ${SUBSET}
Finally, run the postprocessing script to make the predictions compatible with other components in this diarization benchmark:
python utils/postprocess.py --evalDataType ${SUBSET}
Please cite the following paper if our code is helpful to your research.
@article{grauman2021ego4d,
title={Ego4d: Around the world in 3,000 hours of egocentric video},
author={Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and others},
journal={arXiv preprint arXiv:2110.07058},
year={2021}
}