question about voice to face matching evaluation
audio-visual opened this issue · 1 comments
audio-visual commented
May I ask what method is used to compare the description "Our results are given by replacing the probe voice embeddings by the embeddings of the generated face." in the paper?
Is it to firstly compute three feature embeddings using the fixed face embedding network (1: generated face embeddings, 2: true speaker face embeddings and 3: imposter face embeddings), and then compare the cosine similarity of the embeddings between 1&2, 1&3 ?
ydwen commented
Exactly!