MagicHub-io/MagicData-RAMC

SD: something wrong while running step3----Do Speaker Embedding Extractor

axuan731 opened this issue · 13 comments

When I started to run step3, the job began to cycle infinitely, and I had to stop it. The log shows that "Removing initializer 'metric.weight'. It is not used by any node and should be removed from the model." Thanks for solving my problem!!!

python VBx/predict.py --in-file-list ./data/magicdata160h_dev_test/dia_part/exp/CTS-CN-F2F-2019-11-15-160_wav_list.txt --in-lab-dir ./data/magicdata160h_dev_test/dia_part/vad --in-wav-dir /home/duyuxuan/dataset/WAV/ --out-ark-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-160.ark --out-seg-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-160.seg --weights VBx/models/ResNet101_16kHz/nnet/final.onnx --backend onnx
Started at Thu Aug 19 13:02:50 CST 2021

2021-08-19 13:04:08.147775185 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'metric.weight'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.636777410 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.2.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.639711282 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.2.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641722807 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.2.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641747597 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.1.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641755698 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.1.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641762960 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.0.shortcut.1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641774343 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.0.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641783351 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.9.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641791103 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.9.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641799203 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.9.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641807164 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.8.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641817709 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.12.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641825949 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.12.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641837262 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer1.0.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641845363 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.21.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641854999 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.3.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641865474 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer1.1.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641886005 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.15.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641894595 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.19.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641901578 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.17.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641914357 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.1.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641922527 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.6.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641929581 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.11.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641938449 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.0.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641946550 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.10.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641955209 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.11.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641963310 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.0.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641971829 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer2.3.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641978463 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer2.3.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641985726 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer2.3.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641993826 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer1.1.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.642003743 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.20.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.642013799 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer1.0.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.642020083 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer3.20.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.

..........

CYFFF commented

Don't worry. It's just warnings. If you want to remove them, you could refer to microsoft/onnxruntime#1899. Wait patiently for the progress bars in extract_embedding_xxx.log to reach 100%.

Don't worry. It's just warnings. If you want to remove them, you could refer to microsoft/onnxruntime#1899. Wait patiently for the progress bars in extract_embedding_xxx.log to reach 100%.

您好,我这个错误还是没有办法解决。在我进行extract_embeddings这步的时候,如果不使用GPU会报错Removing initializer 'metric.weight'. It is not used by any node and should be removed from the model.这个警告会导致我的job failed并且电脑死机,如果我使用GPU运行extract_embeddings_gpu.sh,同样也会报错,“RuntimeError: CUDA error: shared object initialization failed“以及“RuntimeError: CUDA error: out of memory“,尝试把cmd.sh改为export train_cmd="run.pl -q all.q --mem 2G"依旧报错并卡死。希望得到您的解答,十分感谢!

CYFFF commented

只留一条语音在文件夹里可以跑吗?

CYFFF commented

能给我一条 log 吗?我感觉有可能是内存不够了,这个 warning 应该不会导致 job failed。

@CYFFF 您好,使用GPU和CPU跑所有语音时电脑均会卡死,之前跑神经网络没有这种问题,您帮忙看一下我这个是哪里的问题,麻烦了谢谢。

1.一条语音不能运行,报错如下:

python VBx/predict.py --gpus 0 --in-file-list ./data/magicdata160h_dev_test/dia_part/exp/CTS-CN-F2F-2019-11-15-160_wav_list.txt --in-lab-dir ./data/magicdata160h_dev_test/dia_part/vad --in-wav-dir /home/duyuxuan/dataset/WAV/ --out-ark-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-160.ark --out-seg-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-160.seg --weights VBx/models/ResNet101_16kHz/nnet/raw_81.pth --backend pytorch
Started at Thu Aug 19 21:52:40 CST 2021
INFO:main:Using GPU: 0
INFO:main:Start: Processing file CTS-CN-F2F-2019-11-15-160:
0
filenames: ['CTS-CN-F2F-2019-11-15-160']
Finished the feature extracting (29803600,)
0%| | 0/387 [00:00<?, ?it/s]
0%| | 0/387 [00:02<?, ?it/s]
INFO:main:End: Processing file CTS-CN-F2F-2019-11-15-160: Elapsed: 3.0260753631591797 seconds
Traceback (most recent call last):
File "VBx/predict.py", line 198, in
xvector = get_embedding(
File "VBx/predict.py", line 72, in get_embedding
spk_embeds = model(data)
TypeError: 'str' object is not callable
Accounting: time=3 threads=1
Ended (code 1) at Thu Aug 19 21:52:43 CST 2021, elapsed time 3 seconds

2.跑所有语音使用GPU时,报错有以下两种:

0%| | 0/354 [00:00<?, ?it/s]
0%| | 0/354 [00:17<?, ?it/s]
INFO:main:End: Processing file CTS-CN-F2F-2019-11-15-514: Elapsed: 460.5842273235321 seconds
Traceback (most recent call last):
File "VBx/predict.py", line 184, in
xvector = get_embedding(
File "VBx/predict.py", line 70, in get_embedding
data = torch.from_numpy(fea).to(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Accounting: time=556 threads=1
Ended (code 1) at Thu Aug 19 17:52:16 CST 2021, elapsed time 556 seconds

0%| | 0/407 [00:00<?, ?it/s]
0%| | 0/407 [00:17<?, ?it/s]
INFO:main:End: Processing file CTS-CN-F2F-2019-11-15-1245: Elapsed: 374.02130150794983 seconds
Traceback (most recent call last):
File "VBx/predict.py", line 184, in
xvector = get_embedding(
File "VBx/predict.py", line 70, in get_embedding
data = torch.from_numpy(fea).to(device)
RuntimeError: CUDA error: shared object initialization failed
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Accounting: time=555 threads=1
Ended (code 1) at Thu Aug 19 17:52:15 CST 2021, elapsed time 555 seconds

3.使用CPU跑所有语音时,报错如下:

python VBx/predict.py --in-file-list ./data/magicdata160h_dev_test/dia_part/exp/CTS-CN-F2F-2019-11-15-160_wav_list.txt --in-lab-dir ./data/magicdata160h_dev_test/dia_part/vad --in-wav-dir /home/duyuxuan/dataset/WAV/ --out-ark-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-160.ark --out-seg-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-160.seg --weights VBx/models/ResNet101_16kHz/nnet/final.onnx --backend onnx
Started at Thu Aug 19 13:02:50 CST 2021

2021-08-19 13:04:08.147775185 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'metric.weight'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.636777410 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.2.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.639711282 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.2.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641722807 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.2.bn1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641747597 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.1.bn3.num_batches_tracked'. It is not used by any node and should be removed from the model.
2021-08-19 13:04:10.641755698 [W:onnxruntime:, graph.cc:3211 CleanUnusedInitializers] Removing initializer 'layer4.1.bn2.num_batches_tracked'. It is not used by any node and should be removed from the model.
....

CYFFF commented

我又测了一遍,pytorch 确实存在你描述的问题,extract_embeddings_gpu.sh 17行缺了个参数 --model ResNet101,已经修改,onnx 的确实没问题,有你描述的 log,但可以出结果。

CYFFF commented

代码仓库已经更新,谢谢你提出的问题,你再看看还有啥问题?

您好,现在我提取一条语音都没有问题,但是提取所有语音就会出现以下的内容导致任务失败,CPU和GPU都不行,我已经在cmd中改成了export train_cmd="run.pl --mem 1G",并且提取特征的nj也改成了1,请教一下怎么才能调整一下任务数或者怎么改呢,感谢您的回答!
bash: line 1: 19102 Killed ( python VBx/predict.py --in-file-list ./data/magicdata160h_dev_test/dia_part/exp/CTS-CN-F2F-2019-11-15-1144_wav_list.txt --in-lab-dir ./data/magicdata160h_dev_test/dia_part/vad --in-wav-dir /home/duyuxuan/dataset/WAV/ --out-ark-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-1144.ark --out-seg-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-1144.seg --weights VBx/models/ResNet101_16kHz/nnet/final.onnx --backend onnx ) 2>> ./data/magicdata160h_dev_test/dia_part/exp/extract_embedding_CTS-CN-F2F-2019-11-15-1144.log >> ./data/magicdata160h_dev_test/dia_part/exp/extract_embedding_CTS-CN-F2F-2019-11-15-1144.log
bash: line 1: 19827 Killed ( python VBx/predict.py --in-file-list ./data/magicdata160h_dev_test/dia_part/exp/CTS-CN-F2F-2019-11-15-842_wav_list.txt --in-lab-dir ./data/magicdata160h_dev_test/dia_part/vad --in-wav-dir /home/duyuxuan/dataset/WAV/ --out-ark-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-842.ark --out-seg-fn ./data/magicdata160h_dev_test/dia_part/embedding/CTS-CN-F2F-2019-11-15-842.seg --weights VBx/models/ResNet101_16kHz/nnet/final.onnx --backend onnx ) 2>> ./data/magicdata160h_dev_test/dia_part/exp/extract_embedding_CTS-CN-F2F-2019-11-15-842.log >> ./data/magicdata160h_dev_test/dia_part/exp/extract_embedding_CTS-CN-F2F-2019-11-15-842.log
..........
@CYFFF

CYFFF commented

应该是内存或显存不够了,我稍晚串一个一条条提的脚本

CYFFF commented

已经更新了,前几天太忙把这事忘了

CYFFF commented

现在是一个个算,还可以通过 VBx/predict.py 中 sess_options.intra_op_num_threads 调整 onnx 前向线程数