manojpamk/pytorch_xvectors

Could not find common file: exp/xvector_nnet_1a/egs//egs.1.ark

Closed this issue · 13 comments

Hi, I was trying to train the model and it crashed at stage 6

Namespace(baseLR=0.001, batchSize=32, featDim=30, featDir='exp/xvector_nnet_1a/egs/', local_rank=0, logStepSize=200, maxLR=0.002, modelType='xvecTDNN', noiseEps=1e-05, numArchives=84, numEgsPerArk=366150, numEpochs=2, numSpkrs=7323, optimMomentum=0.5, pDropMax=0.2, preFetchRatio=30, preTrainedModelDir=None, protoEpisodesPerArk=25, protoMaxClasses=35, protoMinClasses=5, resumeModelDir=None, stepFrac=0.5, supportFrac=0.7, totalEpisodes=100, trainingMode='init')
Initializing Model..
Reading from archive 1
Traceback (most recent call last):
  File "train_xent.py", line 69, in <module>
    for _,(X, Y) in par_data_loader:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 284, in __iter__
    with ext_open(self.ark_or_pipe, "rb") as fd:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 106, in __enter__
    self.fd = _fopen(self.fname, self.mode)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 79, in _fopen
    "Could not find common file: {}".format(fname))
FileNotFoundError: Could not find common file: exp/xvector_nnet_1a/egs//egs.1.ark

I don't have this directory exp/xvector_nnet_1a. do you know what may cause this problem?

I have the same question.Could you tell me how to solve this question?

Hello,

The path exp/xvector_nnet_1a/egs/egs.1.ark should be replaced with the nnet3-egs files prepared by the get-egs command. The nnet3-egs files contain data suitable for DNN training.

Unfortunately, I cannot share this data directly. You can download them directly from the author (https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and place them on your computer. Make sure to provide the links here.

Manoj

Hi, thank you for the advice. I already have the dataset voxceleb1 & 2 and musan and RIR on my disk, and have updated the paths in pytorch_run.sh. But the problem still exists. When i look at the project directory, there isn't any xvector_nnet_1a folder under exp/. It seems the egs files are not generated or not located here. what might cause this?

Hello,

If I understand correctly, the script fails at the get_egs.sh command. As far as this command is concerned, exp/xvector_nnet_1a/egs/ is an output directory. You can replace this with wherever you'd like to create the egs.*.ark files - ideally someplace with >400G space.

Just make sure to use the same path in the next step (train_xent.py)

Manoj

Hi,thanks for your reply.But l cant't find get_egs.sh and train_xent.py in your project.So, my problem still exists.

Hi,

get_egs.sh is part of Kaldi which will be available once you create the softlink for sid directory at the beginning of pytorch_run.sh.
train_xent.py is available in this repo.

Hello,

If I understand correctly, the script fails at the get_egs.sh command. As far as this command is concerned, exp/xvector_nnet_1a/egs/ is an output directory. You can replace this with wherever you'd like to create the egs.*.ark files - ideally someplace with >400G space.

Just make sure to use the same path in the next step (train_xent.py)

Manoj

Hi, thank you so much for your time.
I think the script failed at line 205 train_xent.py exp/xvector_nnet_1a/egs/, not the get_egs.sh command. Here's my full log in stage 6:

sid/nnet3/xvector/get_egs.sh --cmd run.pl --nj 8 --stage 0 --frames-per-iter 1000000000 --frames-per-iter-diagnostic 100000 --min-frames-per-chunk 200 --max-frames-per-chunk 400 --num-diagnostic-archives 3 --num-repeats 50 data/train_combined_no_sil exp/xvector_nnet_1a/egs/
sid/nnet3/xvector/get_egs.sh: expected file data/train_combined_no_sil/feats.scp
Namespace(baseLR=0.001, batchSize=32, featDim=30, featDir='exp/xvector_nnet_1a/egs/', local_rank=0, logStepSize=200, maxLR=0.002, modelType='xvecTDNN', noiseEps=1e-05, numArchives=84, numEgsPerArk=366150, numEpochs=2, numSpkrs=7323, optimMomentum=0.5, pDropMax=0.2, preFetchRatio=30, preTrainedModelDir=None, protoEpisodesPerArk=25, protoMaxClasses=35, protoMinClasses=5, resumeModelDir=None, stepFrac=0.5, supportFrac=0.7, totalEpisodes=100, trainingMode='init')
Initializing Model..
Reading from archive 1
Traceback (most recent call last):
  File "train_xent.py", line 69, in <module>
    for _,(X, Y) in par_data_loader:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 284, in __iter__
    with ext_open(self.ark_or_pipe, "rb") as fd:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 106, in __enter__
    self.fd = _fopen(self.fname, self.mode)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 79, in _fopen
    "Could not find common file: {}".format(fname))
FileNotFoundError: Could not find common file: exp/xvector_nnet_1a/egs//egs.1.ark
Traceback (most recent call last):
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/tjw/anaconda3/envs/xvec/bin/python', '-u', 'train_xent.py', '--local_rank=0', 'exp/xvector_nnet_1a/egs/']' returned non-zero exit status 1.

As the Traceback info shows, the error occurred in the python script File "train_xent.py", line 69, in

Hello,
If I understand correctly, the script fails at the get_egs.sh command. As far as this command is concerned, exp/xvector_nnet_1a/egs/ is an output directory. You can replace this with wherever you'd like to create the egs.*.ark files - ideally someplace with >400G space.
Just make sure to use the same path in the next step (train_xent.py)
Manoj

Hi, thank you so much for your time.
I think the script failed at line 205 train_xent.py exp/xvector_nnet_1a/egs/, not the get_egs.sh command. Here's my full log in stage 6:

sid/nnet3/xvector/get_egs.sh --cmd run.pl --nj 8 --stage 0 --frames-per-iter 1000000000 --frames-per-iter-diagnostic 100000 --min-frames-per-chunk 200 --max-frames-per-chunk 400 --num-diagnostic-archives 3 --num-repeats 50 data/train_combined_no_sil exp/xvector_nnet_1a/egs/
sid/nnet3/xvector/get_egs.sh: expected file data/train_combined_no_sil/feats.scp
Namespace(baseLR=0.001, batchSize=32, featDim=30, featDir='exp/xvector_nnet_1a/egs/', local_rank=0, logStepSize=200, maxLR=0.002, modelType='xvecTDNN', noiseEps=1e-05, numArchives=84, numEgsPerArk=366150, numEpochs=2, numSpkrs=7323, optimMomentum=0.5, pDropMax=0.2, preFetchRatio=30, preTrainedModelDir=None, protoEpisodesPerArk=25, protoMaxClasses=35, protoMinClasses=5, resumeModelDir=None, stepFrac=0.5, supportFrac=0.7, totalEpisodes=100, trainingMode='init')
Initializing Model..
Reading from archive 1
Traceback (most recent call last):
  File "train_xent.py", line 69, in <module>
    for _,(X, Y) in par_data_loader:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 284, in __iter__
    with ext_open(self.ark_or_pipe, "rb") as fd:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 106, in __enter__
    self.fd = _fopen(self.fname, self.mode)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 79, in _fopen
    "Could not find common file: {}".format(fname))
FileNotFoundError: Could not find common file: exp/xvector_nnet_1a/egs//egs.1.ark
Traceback (most recent call last):
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/tjw/anaconda3/envs/xvec/bin/python', '-u', 'train_xent.py', '--local_rank=0', 'exp/xvector_nnet_1a/egs/']' returned non-zero exit status 1.

As the Traceback info shows, the error occurred in the python script File "train_xent.py", line 69, in

Hello,have you run this project successed?

Hello,
If I understand correctly, the script fails at the get_egs.sh command. As far as this command is concerned, exp/xvector_nnet_1a/egs/ is an output directory. You can replace this with wherever you'd like to create the egs.*.ark files - ideally someplace with >400G space.
Just make sure to use the same path in the next step (train_xent.py)
Manoj

Hi, thank you so much for your time.
I think the script failed at line 205 train_xent.py exp/xvector_nnet_1a/egs/, not the get_egs.sh command. Here's my full log in stage 6:

sid/nnet3/xvector/get_egs.sh --cmd run.pl --nj 8 --stage 0 --frames-per-iter 1000000000 --frames-per-iter-diagnostic 100000 --min-frames-per-chunk 200 --max-frames-per-chunk 400 --num-diagnostic-archives 3 --num-repeats 50 data/train_combined_no_sil exp/xvector_nnet_1a/egs/
sid/nnet3/xvector/get_egs.sh: expected file data/train_combined_no_sil/feats.scp
Namespace(baseLR=0.001, batchSize=32, featDim=30, featDir='exp/xvector_nnet_1a/egs/', local_rank=0, logStepSize=200, maxLR=0.002, modelType='xvecTDNN', noiseEps=1e-05, numArchives=84, numEgsPerArk=366150, numEpochs=2, numSpkrs=7323, optimMomentum=0.5, pDropMax=0.2, preFetchRatio=30, preTrainedModelDir=None, protoEpisodesPerArk=25, protoMaxClasses=35, protoMinClasses=5, resumeModelDir=None, stepFrac=0.5, supportFrac=0.7, totalEpisodes=100, trainingMode='init')
Initializing Model..
Reading from archive 1
Traceback (most recent call last):
  File "train_xent.py", line 69, in <module>
    for _,(X, Y) in par_data_loader:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 284, in __iter__
    with ext_open(self.ark_or_pipe, "rb") as fd:
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 106, in __enter__
    self.fd = _fopen(self.fname, self.mode)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/kaldi_python_io/inst.py", line 79, in _fopen
    "Could not find common file: {}".format(fname))
FileNotFoundError: Could not find common file: exp/xvector_nnet_1a/egs//egs.1.ark
Traceback (most recent call last):
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/tjw/anaconda3/envs/xvec/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/tjw/anaconda3/envs/xvec/bin/python', '-u', 'train_xent.py', '--local_rank=0', 'exp/xvector_nnet_1a/egs/']' returned non-zero exit status 1.

As the Traceback info shows, the error occurred in the python script File "train_xent.py", line 69, in

The output at second line indicates that feats.scp is missing, hence get_egs.sh did not actually succeed.
The output log from train_xent.py is caused by the above error.

Hello,have you run this project successed?

Not yet. I turned to voxceleb v2 demo provided by kaldi, which is also an implementation of xvector. hope this can help you

The output at second line indicates that feats.scp is missing, hence get_egs.sh did not actually succeed.
The output log from train_xent.py is caused by the above error.

Hi, thanks. I checked my data/train_combined_no_sil/ and there's no file named feats.scp. But I still dont understand why i dont have this.
codes that i have changed in your repository only includes voxceleb1_root and voxceleb2_root in pytorch_run.sh before running. what other work do i need to do to run this project?

The output at second line indicates that feats.scp is missing, hence get_egs.sh did not actually succeed.
The output log from train_xent.py is caused by the above error.

Hi, thanks. I checked my data/train_combined_no_sil/ and there's no file named feats.scp. But I still dont understand why i dont have this.
codes that i have changed in your repository only includes voxceleb1_root and voxceleb2_root in pytorch_run.sh before running. what other work do i need to do to run this project?

The output at second line indicates that feats.scp is missing, hence get_egs.sh did not actually succeed.
The output log from train_xent.py is caused by the above error.

Hi, thanks. I checked my data/train_combined_no_sil/ and there's no file named feats.scp. But I still dont understand why i dont have this.
codes that i have changed in your repository only includes voxceleb1_root and voxceleb2_root in pytorch_run.sh before running. what other work do i need to do to run this project?

Yes, I have the same problem as you. But today I found out that my Voxceleb dataset file was not in the right structure, which may have caused the data to be read incorrectly. So, I'm adjusting the file structure of the dataset.

Yes, I have the same problem as you. But today I found out that my Voxceleb dataset file was not in the right structure, which may have caused the data to be read incorrectly. So, I'm adjusting the file structure of the dataset.

may i know what structure you have now? and does it work?