The problem about last phoneme alignment

Hi, thanks for this great job,
I have tried to integrate it on the top of my asr module, most of the phonemes were aligned perfect except the last, as can see in the below.

the top figure was the original wavform, and the bottom was the alignment result.
I found the wavform approach the end was cut down, and the index_duration was right because the phonemes except the last were aligned accurately.

So how can I solve this problem? thanks in advance.

CTC segmentation can only see the CTC activations.
Therefore, alignments are only as good as the CTC model predictions.

From my experience on Librivox and Tedlium, the alignments are well accurate in most cases.
In only a few cases they are shifted (sometimes with and sometimes without impact on the confidence score).
Sometimes with a tendency to start too early: Due to non-speech audio noise that carries over to CTC activations, which was observable in Tedlium.
Sometimes with a tendency to end too soon: I think that this happens when the CTC network triggers early and the activation is distributed over multiple time steps. This primarily concerns the last utterance from which the backtracking starts.
I observed shifted labels, sometimes up to 10 frames (~300 ms).

How to solve this?
As an experience value, I think that it should be safe in most cases to just add about 100-300 ms to the ending of the last utterance.
Bakhturina et al. added a threshold on the mean absolute value. This works well if you have clean data with low noise.
Also, any method that requires the audio has to be done outside of CTC segmentation, because it can not see the audio.

Also, in your case, you infer phonemes, but to have a better time resolution, do not infer phonemes, but directly characters.

By the way. What is the ground truth to your sentence? 也是为...?

The phoneme sequence is y ie3 sh iii4 w uei4 y iao4 y ian3 d e5 n ian2 q ing1 x van2 sh ou3,
and the sentence is 也是位耀眼的年轻选手, which picked from AISHELL-3 dataset.
The reason I do not use the characters directly is that the audio dataset was too small to cover most natural speech cases, and the same acoustic feature may project to many different homophones, so I think the best choice is to train a language model alone after this acoustic model.

The timings are usually determined by the earliest possible ending time:

ctc-segmentation/ctc_segmentation/ctc_segmentation.py

Lines 371 to 382 in 905d6cc

    
               def compute_time(index, align_type): 
        
                   """Compute start and end time of utterance. 
        
                   :param index:  frame index value 
        
                   :param align_type:  one of ["begin", "end"] 
        
                   :return: start/end time of utterance in seconds 
        
                   """ 
        
                   middle = (timings[index] + timings[index - 1]) / 2 
        
                   if align_type == "begin": 
        
                       return max(timings[index + 1] - 0.5, middle) 
        
                   elif align_type == "end": 
        
                       return min(timings[index - 1] + 0.5, middle)

I assume that the CTC network triggered the activation early for the last phoneme "手" between \sh\ and \ou\ . What is the average duration of a character in Chinese? As a practical solution, I would simply add half a phoneme duration to the ending times.

And also, I use this ctc-segmentation tool to do forced alignment, a necessary step to infer phoneme duration before train my non-autoregressive TTS model.

Cool! Let me know how it worked!

Also, CTC segmentation does not generate forced alignments, but instead gives you the most probable alignments with a probability of how well it was aligned. Sometimes, the alignment does not succeed because text and audio do not match. You should remove bad utterances with a low confidence score.

Great, the idea may be all right, I have test a batch of audio files in Chinese, and found every sample has the same issue, but when I change to English dataset, the issue disappeared.

My TTS model has the same architecture as Fast Speech, except inject a speaker embedding into every timestep input.

And I think this tool is more suitable for me because the montreal forced alignment tool is a little intricate which is based on kaldi and I hope to train all models in tensorflow only without other framework.
Now it solved my problem easily.

Thanks again for your timely response!

现在我有个工程，需要做音的对齐，也是声韵母，请问，用这ctc对齐的思路，能告知一点吗？谢谢＠taylorlu

	def compute_time(index, align_type):
	"""Compute start and end time of utterance.

	:param index: frame index value
	:param align_type: one of ["begin", "end"]
	:return: start/end time of utterance in seconds
	"""
	middle = (timings[index] + timings[index - 1]) / 2
	if align_type == "begin":
	return max(timings[index + 1] - 0.5, middle)
	elif align_type == "end":
	return min(timings[index - 1] + 0.5, middle)