k2-fsa/icefall

Getting segmentation fault

Closed this issue · 9 comments

While running the zipformer training, I am getting segmentation fault in between. The error is not coming always, but it occurs occationally.

Sometimes it occurs during prepare stage, sometimes it occurs during training stage (after some epochs are completed). Attaching the error log.

../Objects/typeobject.c:3690: type_traverse: Assertion failed: type_traverse() called on non-heap type 'HMAC'
Enable tracemalloc to get the memory block allocation traceback
object address : 0x5fee7a0
object refcount : 7
object type : 0x8da7c0
object type name: type
object repr : <class 'HMAC'>
Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized
Thread 0x00007fa9227f8640 (most recent call first):
File "/usr/lib/python3.9/threading.py", line 312 in wait
File "/usr/lib/python3.9/multiprocessing/queues.py", line 231 in _feed
File "/usr/lib/python3.9/threading.py", line 917 in run
File "/usr/lib/python3.9/threading.py", line 980 in _bootstrap_inner
File "/usr/lib/python3.9/threading.py", line 937 in _bootstrap
Thread 0x00007fa922ff9640 (most recent call first):
File "/usr/lib/python3.9/threading.py", line 312 in wait
File "/usr/lib/python3.9/multiprocessing/queues.py", line 231 in _feed
File "/usr/lib/python3.9/threading.py", line 917 in run
File "/usr/lib/python3.9/threading.py", line 980 in _bootstrap_inner
File "/usr/lib/python3.9/threading.py", line 937 in _bootstrap
Thread 0x00007faab1a4d640 (most recent call first):

Thread 0x00007fab4ddc1640 (most recent call first):
File "/usr/lib/python3.9/threading.py", line 316 in wait
File "/usr/lib/python3.9/queue.py", line 180 in get
File "/usr/local/lib/python3.9/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
File "/usr/local/lib/python3.9/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
File "/usr/lib/python3.9/threading.py", line 980 in _bootstrap_inner
File "/usr/lib/python3.9/threading.py", line 937 in _bootstrap
Current thread 0x00007fac0829f000 (most recent call first):
File "/workspace/lhotse/lhotse/audio.py", line 290 in from_dict
File "/workspace/lhotse/lhotse/audio.py", line 924 in
File "/workspace/lhotse/lhotse/audio.py", line 924 in from_dict
File "/workspace/lhotse/lhotse/cut/mono.py", line 308 in from_dict
File "/workspace/lhotse/lhotse/serialization.py", line 552 in deserialize_item
File "/workspace/lhotse/lhotse/lazy.py", line 216 in iter
File "/workspace/lhotse/lhotse/lazy.py", line 165 in values
File "/workspace/lhotse/lhotse/lazy.py", line 165 in values
File "/workspace/lhotse/lhotse/utils.py", line 918 in streaming_shuffle
File "/workspace/lhotse/lhotse/dataset/sampling/dynamic.py", line 365 in iter
File "/workspace/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 415 in _collect_cuts_in_buckets
File "/workspace/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 405 in iter
File "/workspace/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 248 in _next_batch
File "/workspace/lhotse/lhotse/dataset/sampling/base.py", line 265 in next
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 618 in _next_index
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1339 in _try_put_index
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1357 in _process_data
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1313 in _next_data
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 628 in next
File "/builds/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 965 in train_one_epoch
File "/builds/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1401 in run
File "/builds/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1523 in main
File "/builds/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1530 in
/usr/bin/bash: line 211: 37 Aborted (core dumped) ./zipformer/train.py --num-epochs $EPOCH --start-epoch $START_EPOCH --av $AVG --use-fp16 1 --enable-musan False --exp-dir zipformer/$EXP_DIR/${AUTO} --causal $CAUSAL --num-encoder-layers $NUM_ENCODER_LAYERS --feedforward-dim $FEEDFORWARD_DIM --encoder-dim $ENCODER_DIM --encoder-unmasked-dim $ENCODER_UNMASKED_DIM --base-lr 0.04 --full-libri 1 --bpe-model data/${AUTO}/lang_bpe_500/bpe.model --manifest-dir data/${AUTO}/fbank

JinZr commented

I am using lotse version v1.16

JinZr commented

Thank @JinZr. I will check and get back to you.

Typically Lhotse dataloader segmentation fault means CPU OOM. You can also try to decrease buffer size and/or shuffle buffer size.

dmesg is a good friend to check if there was an OOM.

I installed the latest version of lhotse V1.24.1. I am still getting the segmentation fault

I checked the CPU memory usage using htop, memory is availbe while the training is running.

@pzelasko how can I decrease the buffer size and/or shuffle buffer size ?

JinZr commented