Segmentation fault when trying to reproduce on Stanford 3D Dataset
FengZicai opened this issue · 0 comments
FengZicai commented
I encountered this problem when trying to run
./scripts/train_stanford.sh 4 "default" "--stanford3d_path ./Stanford3D"
When I set --num_workers to 0,it reports as follows:
./scripts/train_stanford.sh: line 34: 30654 Segmentation fault python3 -m main --dataset StanfordArea5Dataset --batch_size $BATCH_SIZE --scheduler PolyLR --model Res16UNet34 --conv1_kernel_size 5 --log_dir $LOG_DIR --lr 1e-1 --max_iter 60000 --data_aug_color_trans_ratio 0.05 --data_aug_color_jitter_std 0.005 $3 2>&1
30655 Done | tee -a "$LOG"
When I set --num_workers to 1,it reports as follows:
yq01-qianmo-com-127-2-22 12/29 14:21:28 ===> Start testing
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
fd = df.detach()
File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/connection.py", line 620, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/miniconda3/envs/py3-mink/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/miniconda3/envs/py3-mink/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/spatiotemporalsegmentation/main.py", line 162, in <module>
main()
File "/spatiotemporalsegmentation/main.py", line 157, in main
test(model, test_data_loader, config)
File "/spatiotemporalsegmentation/lib/test.py", line 98, in test
coords, input, target = data_iter.next()
File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
success, data = self._try_get_data()
File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 5202) exited unexpectedly
I tried it on two computers. And I have tried with different versions,
- cuda 11.1
- MinkowskiEngine 0.5.4
- pytorch 1.9.0
or
- cuda 10.2
- MinkowskiEngine 0.4.3
- pytorch 1.5.0 or 1.7.1 or 1.9.0 or 1.10.2
Could you please tell me which version of MinkowskiEngine I should use?
I also tested step by step and found that the problem occurred in 96 line of lib/test.py/:
coords, input, target = data_iter.next()
I have been troubled by this problem for several days. Could you please provide me with some ideas to solve this problem?