joonson/syncnet_python

Want to grab where and whose the speech start and end

Closed this issue · 1 comments

Hi, is it possible to extract what time (or where) the speech of each speaker start and end?
I want to extract speech of each speaker so it needs to know when the speech matched to the speakers and end.

Hi, you can use the frame-wise confidence ('fconfm' inside SyncNetInstance.py) and set a threshold. This is the frame number, so you decide the frame index by 25 to get the time in seconds. To make datasets such as LRS and VoxCeleb, we used thresholds of 3 to 4.