burchim/EfficientConformer

mean loss inf - batch loss: inf

debasish-mihup opened this issue · 9 comments

During the training, I have around 9000+ batches of each having 32 audio segment data. After running the training for 1 epoch the mean loss and batch loss haven't changed from inf. Can you give some idea where the problem might be?

Hi,

CTC loss computing requires T <= U.
Where T is the length of the model output.
And U is the length of the target sequence.
The loss will return inf if T > U.

Your tokenized sequences might be too long.

@burchim I am using EfficientConformerCTCSmall.json

And have changed the below config slightly to allow larger audio segments

   "train_audio_max_length": 320000, # Allows up to 20 sec audio files assuming 16KHz sample rate 
    "train_label_max_length": 256000, # Kept at default.

With the same dataset I am able to run the ConformerCTCSmall.json training without any problem

beam_size has been changed to 8 and other parameters are all default

most of the audio segments are small (around 7-8 seconds) and the transcription length for each file is also small. Post tokenization the number of tokens in each sentence is around 60-70 tokens in length.

So any input about what should be maximum number of tokens so that loss does not shoot to inf

Hi,

Yes the problem should be solved using the ConformerCTCSmall.json config. The original Conformer downsamples the mel spectrogram input by a factor of 4. (against 8 for the Efficient Conformer)
Note that the train_audio_max_length config change will also result in larger target sequences.

You can try to debug by adding the following line here:
print("T:", f_len.tolist(), "U:", y_len.tolist())
This will print lists of model output len (T) and target len (U) before computing the CTC loss.
T>U for one of the sample will result in inf batch loss.

Hi,

Yes the problem should be solved using the ConformerCTCSmall.json config. The original Conformer downsamples the mel spectrogram input by a factor of 4. (against 8 for the Efficient Conformer) Note that the train_audio_max_length config change will also result in larger target sequences.

You can try to debug by adding the following line here: print("T:", f_len.tolist(), "U:", y_len.tolist()) This will print lists of model output len (T) and target len (U) before computing the CTC loss. T>U for one of the sample will result in inf batch loss.

I think you meant adding the print statement inside LossCTC class forward method rather than LossRNNT class's forward method. I did the same and length is same for both U and T. The batch loss is inf for all batches. I have copied below actual print statement output seen.

Also, by T>U you meant to say about the T[i]>U[i] where i=0 to (batch_size-1). If this condition is true for even one of the values of i. Then entire batch loss will be inf - Is my understanding correct?

T = [205, 194, 185, 180, 175, 174, 166, 162, 156, 152, 129, 105, 103, 92, 74, 74, 74, 73, 71, 58, 57, 49, 47, 39, 35, 33, 32, 31, 22, 8, 7, 7]
U = [169, 166, 188, 157, 177, 128, 183, 123, 144, 175, 134, 95, 78, 104, 77, 80, 101, 89, 69, 13, 52, 53, 4, 38, 36, 48, 30, 35, 22, 5, 6, 8]


T = [191, 189, 186, 183, 182, 177, 177, 145, 131, 124, 124, 107, 85, 76, 76, 71, 68, 62, 62, 62, 57, 55, 48, 44, 34, 29, 24, 22, 20, 13, 11, 8] 
U =[125, 222, 162, 138, 118, 162, 148, 129, 127, 108, 55, 141, 60, 88, 97, 78, 64, 48, 65, 71, 53, 70, 23, 57, 42, 36, 25, 23, 22, 14, 13, 5]

Yes this is what I meant.
You can try setting zero_infinity to true.
self.loss = nn.CTCLoss(blank=0, reduction="none", zero_infinity=True)
This should solve your problem by ignoring items in the batch causing inf loss.
Using a larger vocab size for the tokenizer should also help to reduce target lengths.

Yes this is what I meant. You can try setting zero_infinity to true. self.loss = nn.CTCLoss(blank=0, reduction="none", zero_infinity=True) This should solve your problem by ignoring items in the batch causing inf loss. Using a larger vocab size for the tokenizer should also help to reduce target lengths.

I did increase vocab size to 1000 and also changed the zero_infinity=True. The zero_infinity itself helped and removed the inf loss problem. But what I can see after training and running around 12-15 epoch is that loss value is not reducing and the loss values remain same. My training has around 295k audio segments and thus after 12-15 epoch more than 50k model training steps are already completed. My belief is that model is not converging to ideal solution due to some issue or can I continue my training. Attaching below the screenshot of my tensorboard.

TensorBoard1

TensorBoard2

TensorBoard3

Looks like the downsampling and nature of the data did not allow CTC model to train properly. I trained the Efficient Conformer model which worked OK. If the author has any suggestions, I can try them. Closing the issue.

Hi,

The format of your text data does not seem appropriate. The transcriptions of your audio samples should be formatted into 'plain text'. Otherwise, the byte-pair-encoding tokenizer may not be trained correctly. This will also be easier to understand the output of the model.

Hi,

The format of your text data does not seem appropriate. The transcriptions of your audio samples should be formatted into 'plain text'. Otherwise, the byte-pair-encoding tokenizer may not be trained correctly. This will also be easier to understand the output of the model.

Sorry I did not reply to this comment earlier. The reason you are seeing the transcript data in that form is that it is phoneme data of the transcript (phonemes have been encoded to single character symbols). The reason for this is too handle segments containing multiple language input. I did run the tokenizer on this encoded character script but was still unable to run the CTC model to converge. I was able to train efficient transducer model.