DigitalPhonetics/speaker-anonymization

Affective/emotional information conservation?

Closed this issue · 2 comments

Hello!

I am interested in using your tool to anonymize audio coming from a multimodal dataset, but wanted to know if your method maintains audio properties that could be used for emotion speech recognition. I see a brief mention suggesting so in the "Prosody Is Not Identity" paper (end of Section 2.2), but no formal comparison if this actually occurs.

Have you tried, or are you planning to, run emotion recognition pipelines on the original and anonymized data to check if the performance degrades?

Thanks a lot for the great and interesting tool and work!

Hi,

We have not checked yet how well the anonymization preserves or destroys properties that are needed for emotion recognition. We are planning to do this at some point but I cannot give you an estimate about when this will be at the moment.

If you want to use the anonymization in a speech recognition application, you should definitely use the latest model using prosody cloning. Theoretically, if prosody is the main property that carries emotional information, emotion recognition should still be possible after anonymization. You might get better results if you train or finetune your recognizer on anonymized or other synthesized data. If you happen to test how much the recognition performance is affected by anonymization, it would be great if you could share your results.

Thank you for your reply! I will keep the tool in mind and let you know if I have progress with the analysis :). Cheers.