Segmentation fault encountered when entering the second epoch with num_workers>0
RulinShao opened this issue · 5 comments
Hi, thanks for your code. I encountered this issue when running the training script of kd (i.e. resnet34 -> resnet18). It seems something is wrong with the data loader worker. The log is as follows:
2021/04/28 08:41:23 INFO main Updating ckpt (Best top1 accuracy: 0.0000 -> 20.4760)
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "kd_main.py", line 183, in
main(argparser.parse_args())
File "kd_main.py", line 165, in main
train(teacher_model, student_model, dataset_dict, ckpt_file_path, device, device_ids, distributed, config, args)
File "kd_main.py", line 124, in train
train_one_epoch(training_box, device, epoch, log_freq)
File "kd_main.py", line 61, in train_one_epoch
metric_logger.log_every(training_box.train_data_loader, log_freq, header):
File "/home/ec2-user/.local/lib/python3.7/site-packages/torchdistill/misc/log.py", line 153, in log_every
for obj in iterable:
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 355, in iter
return self._get_iterator()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 940, in init
self._reset(loader, first_iter=True)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 971, in _reset
self._try_put_index()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1205, in _try_put_index
index = self._next_index()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 508, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 227, in iter
for idx in self.sampler:
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 125, in iter
yield from torch.randperm(n, generator=self.generator).tolist()
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 48668) is killed by signal: Segmentation fault.
Hopefully someone can help me address this issue. Thanks!
Hi @RulinShao
I have never seen the error before and would need more detail.
Could you clarify
- whether or not
kd_main.py
is identical toexamples/image_classification.py
I provided, - exact command you used to run
kd_main.py
and config file, and - environment info (OS, version of Python, torch, torchvision, torchdistill, etc)
?
Also, the following link could help you, and it may be an environmental issue
facebookresearch/detectron2#954
Thank you
Thanks for your reply! The kd_main.py
is identical to examples/image_classification.py
and I ran it by
python3 kd_main.py --config configs/ilsvrc2012/single_stage/kd/resnet18_from_resnet34.yaml --log log/ilsvrc2012/kd/resnet_from_vit.txt
where the config file is identical to torchdistill/configs/sample/ilsvrc2012/single_stage/kd/resnet18_from_resnet152.yaml
except that I changed the teacher model from resnet152 to resnet34 for faster debugging speed.
Some details of the environment:
Python==3.7.9,
torch==1.8.1+cu102,
torchvision==0.9.1+cu102,
and torchdistill was just git cloned few days ago.
And thanks for the link which said his environment issue was solved by using sudo apt-get install package
. However, as I use Amazon Linux, I need use yum install
instead of apt-get install
and these packages are not found by yum
. I guess it's caused by the dependencies of some low-level libraries indeed where I have little knowledge. Kindly you could help me find out what is the problem. Currently I can only train one epoch each time and rerun the script while loading the ckpt when num_workers>0, or just set num_worker=0 which is quite time-consuming.
Thanks again for your help!
Hi @RulinShao
Thank you for the info.
I found this discussion useful for you. One of the users in this thread provides a solution for CentOS, which should be compatible with Amazon Linux as these are based on RHEL.
Also ICYMI, most of the config files under torchdistill/configs/sample are not tuned but used for debug as described torchdistill/configs/. If you want to see the improvements over standard training after debugging, you should either tune the hyperparameters in the config file or use some of those under torchdistill/configs/official/.
Hope this helps
Recently, I faced a similar error with a different project on a new environment and resolved the issue by stopping thread control in evaluate
.
Fetch and try the updated image_classification.py
Thanks a lot for your help!!! Stopping the thread control works for me. I'm closing this issue now.