mravanelli/pytorch-kaldi

No Decoding Output

kevinmchu opened this issue · 20 comments

I'm running the TIMIT LSTM on custom features, and I obtained the following error in my log.log file:

image

I checked my best path file, but did not see any error messages or warnings.

image

I've also double checked my cfg file, and all of the directories exist. I'm running Ubuntu 16.04, CUDA 10.2, PyTorch 1.7.1. What am I doing wrong?

Hi, as we can see from the log, "Done 0 lattices" hence something went wrong during the forward phase. I would recommend removing all the directories related to the decoding and remove the forward files generated by pytorch Kaldi (the one created when forward the test). Then start again and check that the forward process goes smoothly.

Thanks for the quick reply. I removed the decoding directories and forward files and reran the model on the test set, but I obtained the same error as before.

Does the forward phase runs smoothly ? Can you see it ?

This is the output I obtain when I run the model on the test data:

  • Reading config file......OK!
  • Chunk creation......OK!

Testing TIMIT_test chunk = 1 / 1
[========================================] 100% Forwarding | (Batch 192/192))
Decoding TIMIT_test output out_dnn2

Does this indicate that the forward phase ran smoothly?

Yep. Is the final.mdl model existing ? Can you check his size ? Also, you could try to run manually the Kaldi command line that fails ..

Yes, final.mdl exists and has a size of 5.2MB.

As for manually re-running latgen-faster-mapped, where can I find the values of the $thread_string, $min_active, $max_active, etc.?

Also, I was able to run the decoder for an LSTM trained on MFCCs, which makes me think there is something wrong with my features.

Weird ..

@mravanelli Do you have any insight about this issue?

It is most likely that the forwarded data are empty. How fast was the forward phase ? If it is super quick, it might indicate that your input features are indeed not good. You definitely should try to call the command manually to get the different output, like checking if the lattices are empty.

The forward phase lasted ~10 minutes. I ran latgen-faster-mapped without any errors, but the lattices were empty.

So in TIMIT_test output out_dnn2 all the lat.*.gz are empty ?

If so please check that your $finalfeats (don't know where you saved them) are ok (not empty).

I just realized I forgot to change the fea_name in the configuration file. However, when I changed fea_name to the correct name, I obtained this error:

ERROR: the input "mfcc" is not defined before (possible inputs are ['xxxx'])

I removed all directories relating to the trained model and re-trained over 1 epoch. However, I am not getting a final.mdl file when training finishes. The log.log file does not show any errors or warnings. I did receive this warning on the terminal:

/home/lab/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_base.py:1717: UserWarning: Attempting to set identical left==right results
in singular transformations; automatically expanding.
left=0, right=0
self.set_xlim([v[0], v[1]], emit=emit, auto=False)

Does this explain the missing final.mdl file?

No, the final.mdl only appears if you reach the number of epochs given in the config file.

To clarify, in the cfg file I set n_epochs_tr to 1 but still did not get a final.mdl. Is there something else I am supposed to change if I only want tot train over 1 epoch?

I solved the problem with the missing final.mdl. I split run_exp.py into training and testing scripts, and it turns out I needed to run the testing script for final.mdl to appear.

However, I am experiencing the same problem as before during decoding where the forward phase runs smoothly, but I do not obtain any output. My lat.1.gz file is only 20 bytes. forward_TIMIT_test_ep0_ck0_out_dnn2_to_decode.ark is 2.1GB, which seems reasonable. Any other ideas?

@TParcollet @mravanelli I just wanted to follow up and ask if you have any more insight about this issue.

I figured out the problem. The issue was a mismatch between my lab_graph and lab_folder, which by default raises a segmentation fault.