Segmentation fault encountered when entering the second epoch with num_workers>0

Question

Segmentation fault encountered when entering the second epoch with num_workers>0

RulinShao opened this issue 4 years ago · 5 comments

Hi, thanks for your code. I encountered this issue when running the training script of kd (i.e. resnet34 -> resnet18). It seems something is wrong with the data loader worker. The log is as follows:

2021/04/28 08:41:23 INFO main Updating ckpt (Best top1 accuracy: 0.0000 -> 20.4760)
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "kd_main.py", line 183, in
main(argparser.parse_args())
File "kd_main.py", line 165, in main
train(teacher_model, student_model, dataset_dict, ckpt_file_path, device, device_ids, distributed, config, args)
File "kd_main.py", line 124, in train
train_one_epoch(training_box, device, epoch, log_freq)
File "kd_main.py", line 61, in train_one_epoch
metric_logger.log_every(training_box.train_data_loader, log_freq, header):
File "/home/ec2-user/.local/lib/python3.7/site-packages/torchdistill/misc/log.py", line 153, in log_every
for obj in iterable:
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 355, in iter
return self._get_iterator()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 940, in init
self._reset(loader, first_iter=True)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 971, in _reset
self._try_put_index()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1205, in _try_put_index
index = self._next_index()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 508, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 227, in iter
for idx in self.sampler:
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 125, in iter
yield from torch.randperm(n, generator=self.generator).tolist()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 48668) is killed by signal: Segmentation fault.

Hopefully someone can help me address this issue. Thanks!

Answer 1 · 2021-04-28T17:16:43.000Z

Hi @RulinShao

I have never seen the error before and would need more detail.
Could you clarify

whether or not kd_main.py is identical to examples/image_classification.py I provided,
exact command you used to run kd_main.py and config file, and
environment info (OS, version of Python, torch, torchvision, torchdistill, etc)
?

Also, the following link could help you, and it may be an environmental issue
facebookresearch/detectron2#954

Thank you

Answer 2 · 2021-04-29T10:01:52.000Z

Hi @yoshitomo-matsubara

Thanks for your reply! The kd_main.py is identical to examples/image_classification.py and I ran it by

python3 kd_main.py --config configs/ilsvrc2012/single_stage/kd/resnet18_from_resnet34.yaml --log log/ilsvrc2012/kd/resnet_from_vit.txt

where the config file is identical to torchdistill/configs/sample/ilsvrc2012/single_stage/kd/resnet18_from_resnet152.yaml except that I changed the teacher model from resnet152 to resnet34 for faster debugging speed.

Some details of the environment:
Python==3.7.9,
torch==1.8.1+cu102,
torchvision==0.9.1+cu102,
and torchdistill was just git cloned few days ago.

And thanks for the link which said his environment issue was solved by using sudo apt-get install package. However, as I use Amazon Linux, I need use yum install instead of apt-get install and these packages are not found by yum. I guess it's caused by the dependencies of some low-level libraries indeed where I have little knowledge. Kindly you could help me find out what is the problem. Currently I can only train one epoch each time and rerun the script while loading the ckpt when num_workers>0, or just set num_worker=0 which is quite time-consuming.

Thanks again for your help!

Answer 3 · 2021-04-29T16:24:24.000Z

Hi @RulinShao

Thank you for the info.
I found this discussion useful for you. One of the users in this thread provides a solution for CentOS, which should be compatible with Amazon Linux as these are based on RHEL.

Also ICYMI, most of the config files under torchdistill/configs/sample are not tuned but used for debug as described torchdistill/configs/. If you want to see the improvements over standard training after debugging, you should either tune the hyperparameters in the config file or use some of those under torchdistill/configs/official/.

Hope this helps

Answer 4 · 2021-05-03T04:25:42.000Z

@RulinShao

Recently, I faced a similar error with a different project on a new environment and resolved the issue by stopping thread control in evaluate.
Fetch and try the updated image_classification.py

Answer 5 · 2021-05-05T06:24:09.000Z

Thanks a lot for your help!!! Stopping the thread control works for me. I'm closing this issue now.