chrischoy/SpatioTemporalSegmentation

Segmentation fault when trying to reproduce on Stanford 3D Dataset

FengZicai opened this issue · 0 comments

I encountered this problem when trying to run

./scripts/train_stanford.sh 4 "default" "--stanford3d_path ./Stanford3D"

When I set --num_workers to 0,it reports as follows:

./scripts/train_stanford.sh: line 34: 30654 Segmentation fault python3 -m main --dataset StanfordArea5Dataset --batch_size $BATCH_SIZE --scheduler PolyLR --model Res16UNet34 --conv1_kernel_size 5 --log_dir $LOG_DIR --lr 1e-1 --max_iter 60000 --data_aug_color_trans_ratio 0.05 --data_aug_color_jitter_std 0.005 $3 2>&1
     30655 Done | tee -a "$LOG"

When I set --num_workers to 1,it reports as follows:

yq01-qianmo-com-127-2-22 12/29 14:21:28 ===> Start testing
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/connection.py", line 492, in Client
    c = SocketClient(address)
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/connection.py", line 620, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/miniconda3/envs/py3-mink/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/miniconda3/envs/py3-mink/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/spatiotemporalsegmentation/main.py", line 162, in <module>
    main()
  File "/spatiotemporalsegmentation/main.py", line 157, in main
    test(model, test_data_loader, config)
  File "/spatiotemporalsegmentation/lib/test.py", line 98, in test
    coords, input, target = data_iter.next()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 5202) exited unexpectedly

I tried it on two computers. And I have tried with different versions,

  • cuda 11.1
  • MinkowskiEngine 0.5.4
  • pytorch 1.9.0

or

  • cuda 10.2
  • MinkowskiEngine 0.4.3
  • pytorch 1.5.0 or 1.7.1 or 1.9.0 or 1.10.2

Could you please tell me which version of MinkowskiEngine I should use?

I also tested step by step and found that the problem occurred in 96 line of lib/test.py/:

        coords, input, target = data_iter.next()

I have been troubled by this problem for several days. Could you please provide me with some ideas to solve this problem?