Global Rhythm Style Transfer Without Text Transcriptions

Problem statement

Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information.
Two major components of prosody are pitch and rhythm.
AutoPST disentangle global prosody style from speech without relying on any text transcriptions.

Our Implementation

We have implemented a Prosody style transfer system for speech using autoencoders with a novel resampling module. We trained models with MFCCs as input and demonstrated performance in different fast, slow and abnormal settings. Taking inspiration from the paper on Rhythm Style Transfer (https://arxiv.org/pdf/2106.08519.pdf), we have experimented for the following cases:-

Fast speaker to slow speaker
Slow speaker to fast speaker
Abnormal speech pattern

Aakriti28/GlobalStyleTransferASR

Global Rhythm Style Transfer Without Text Transcriptions

Problem statement

Our Implementation