yl4579/StyleTTS

Inference exact time for each word

enla51 opened this issue · 5 comments

enla51 commented

Hi,

Is it possible to output the exact time when a token is being pronounced in the sound file? So if the input sentence is: "How are you?". Does the model output contain any information similar to: second token 'are' starts being pronounced at second 3?

Thank you very much for this amazing project!

I think you can get this information from the alignment matrix pred_aln_trg
at the inference notebook https://github.com/yl4579/StyleTTS/blob/main/Demo/Inference_LibriTTS.ipynb

enla51 commented

Thank you very much!

Hi @yl4579 ,

I hope you're doing well. I'm currently working on implementing word duration and start time calculations for a speech synthesis task using your model. However, I'm encountering some difficulties in ensuring the accuracy of these calculations.

Here's a brief overview of what I'm doing:

  1. Duration Prediction: I'm using the model to predict token durations, and then constructing an alignment matrix (pred_aln_trg) based on these predicted durations.
  2. Word Durations and Start Times: I'm trying to sum the durations of tokens corresponding to each word and convert these frame counts to time durations using the hop_size and samplerate.

Despite my efforts, the calculated durations and start times seem too short compared to the expected values.

Here is a snippet of my current approach:

hop_size=300
samplerate=24000
# Calculate word durations
word_durations = []
current_word_duration = 0
frame_counts = pred_aln_trg.sum(dim=1).cpu().numpy()
for token_index, token in enumerate(tokens[0]):
    current_word_duration += frame_counts[token_index]
    if token == 16:  # Token 16 represents a space
        word_durations.append(current_word_duration * hop_size / samplerate)
        current_word_duration = 0
if current_word_duration > 0:
    word_durations.append(current_word_duration * hop_size / samplerate)

# Calculate start times
start_times = [0]
for i in range(1, len(word_durations)):
    start_times.append(start_times[-1] + word_durations[i - 1])

print(f"Word durations (seconds): {word_durations}")
print(f"Start times (seconds): {start_times}")

Despite these steps, the computed durations and start times are consistently too short. Could you provide any insights or suggestions on where I might be going wrong or how to improve the accuracy of these calculations?

I appreciate your help and look forward to your response.

Best regards,
Alessandro

@alessandropettenuzzo96 Any luck on this implementation? Would highly appreciate it if you could share some insights.

@sinhprous Is there any implementation to fetch word level timestamps of generated audio? It would be really helpful.