The error when training

Question

The error when training

daixiaolei623 opened this issue 3 years ago · 4 comments

Thank you for your great work.
However, when i train the maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml using the commend:
./train_net.py --num-gpus 2 --config-file configs/ade20k-150/swin/maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml .

I got the following errors:
`MaskFormer Training Script.

This script is a simplified version of the training script in detectron2/tools.
: No such file or directory
import-im6.q16: not authorized copy' @ error/constitute.c/WriteImage/1037. import-im6.q16: not authorized itertools' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized logging' @ error/constitute.c/WriteImage/1037. import-im6.q16: not authorized os' @ error/constitute.c/WriteImage/1037.
from: can't read /var/mail/collections
from: can't read /var/mail/typing
import-im6.q16: not authorized torch' @ error/constitute.c/WriteImage/1037. import-im6.q16: not authorized comm' @ error/constitute.c/WriteImage/1037.
from: can't read /var/mail/detectron2.checkpoint
from: can't read /var/mail/detectron2.config
from: can't read /var/mail/detectron2.data
from: can't read /var/mail/detectron2.engine
./train_net.py: line 21: syntax error near unexpected token (' ./train_net.py: line 21: from detectron2.evaluation import ('`

Could you please tell me what is the problem and how to solve it?
thank you very much!

Answer 1 · 2021-10-09T19:50:12.000Z

Add python

Answer 2 · 2021-10-09T23:02:21.000Z

@bowenc0221
Thank you.
However, i have add python and install cuda-11.1, i run python ./train_net.py --num-gpus 2 --config-file /home/dai/code/semantic_segmentation/27/MaskFormer-master/configs/ade20k-150/swin/maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml, and got the following error:

`Command Line Args: Namespace(config_file='/home/dai/code/semantic_segmentation/27/MaskFormer-master/configs/ade20k-150/swin/maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False)
/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
File "./train_net.py", line 270, in
args=(args,),
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 79, in launch
daemon=False,
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 95, in _distributed_worker
assert torch.cuda.is_available(), "cuda is not available. Please check your installation."
AssertionError: cuda is not available. Please check your installation.`

Answer 3 · 2021-10-10T14:41:32.000Z

@bowenc0221
thank you , i have solved the above error, but my GPU is 1080Ti, which is out of memory, i want to train on CPU, my CPU is 64G,
Could you please tell me how to train it on CPU?
thank you.

Answer 4 · 2021-10-12T18:30:11.000Z

You can try adding MODEL.DEVICE 'cpu' at the end of your command, but I have never tested it with CPU.