Question about cache/wsj_8k_zeromean/11-13.1/wsj0/doc/indices/train/tr_s_wv1.ndx
Closed this issue · 2 comments
Dear fgnt,
Now I'm trying to create SMS-WSJ dataset, but I have some problems.
- First problem
While running python -m sms_wsj.database.wsj.create_json with json_path=$(JSON_DIR)/wsj_8k_zeromean.json database_dir=$(WSJ_8K_ZEROMEAN_DIR) as_wav=True
in Makefile,
I got following error:
File "sms_wsj/sms_wsj/database/wsj/create_json.py", line 146, in process_example_paths
'kaldi_transcription': transcript['kaldi'][example_id]
KeyError: '401c0202'
I found that the cause is cache/wsj_8k_zeromean/11-13.1/wsj0/doc/indices/train/tr_s_wv1.ndx
.
Although 401c0202.wv1 is only in WSJ0_root/11-3.1/wsj0/si_tr_s/401/
in my case,
tr_s_wv1.ndx has two lines about 401c0202 as follows:
- 11_2_1:wsj0/si_tr_s/401/401c0202.wv1
- 11_3_1:wsj0/si_tr_s/401/401c0202.wv1
Thus, when the program tried to access 11-2.1/wsj0/si_tr_s/401/401c0202.wv1, it stopped.
What is the cause of this problem ? Code ? WSJ0?
- Second problem
I always get the following warining.
WARNING:Create wsj json:No observers have been added to this run
How can I solve this problem?
- Third problem
The numbers of files are different from what you expected.
expected -> 'pl': 3, 'ndx': 106, 'ptx': 3547, 'dot': 3585, 'txt': 256
I found -> 'pl': 3, 'ndx': 106, 'ptx': 3073, 'dot': 3095, 'txt': 208
Does this cause a big problem?
I am sorry to cause you inconvenience, but I am looking forward to your reply.
Thank you for your interest in SMS-WSJ,
let me answer your questions step by step:
First Problem:
The ndx file does not seem to be the problem. We have the same two lines in our file.
Did you run the setup for the KALDI wsj example? The problem might occur if you did not specify a working kaldi wsj data directory.
You should have the directory $KALDI_ROOT/egs/wsj/s5/data/local/data
If you have questions regarding how to set up KALDI please refer to the kaldi repository.
When you have questions specifically to the required steps in the kaldi wsj run script, please open a new Issue.
Second Problem:
This is just a warning and can be ignored, we are deliberately choosing not to use an observer here. However, we will discuss whether we can avoid the warning in a future update.
Third Problem:
If you are missing some essential data, the script should raise an error. Therefore, I would assume the missing data are not a problem going forward.
Thank you for your advice.
I managed to create the dataset.