ewrfcas/MVSFormer

作者你好,单卡4090无法训练问题

chenhui2016 opened this issue · 0 comments

我将config文件参数设置为如图所示才能运行
image

然而train到下面这个位置报错
image

报错信息如下:Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/ch/anaconda3/envs/pt/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/home/ch/anaconda3/envs/pt/lib/python3.10/site-packages/tensorboardX/event_file_writer.py", line 202, in run
data = self._queue.get(True, queue_wait_duration)
File "/home/ch/anaconda3/envs/pt/lib/python3.10/multiprocessing/queues.py", line 117, in get
res = self._recv_bytes()
File "/home/ch/anaconda3/envs/pt/lib/python3.10/multiprocessing/connection.py", line 221, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/ch/anaconda3/envs/pt/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/home/ch/anaconda3/envs/pt/lib/python3.10/multiprocessing/connection.py", line 388, in _recv
raise EOFError
EOFError
Traceback (most recent call last):
File "/home/ch/sn_d/code/MVS/MVSFormer-main/train.py", line 191, in
mp.spawn(main, nprocs=args.world_size, args=(args, config))
File "/home/ch/anaconda3/envs/pt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ch/anaconda3/envs/pt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/home/ch/anaconda3/envs/pt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ch/anaconda3/envs/pt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/ch/sn_d/code/MVS/MVSFormer-main/train.py", line 146, in main
trainer.train()
File "/home/ch/sn_d/code/MVS/MVSFormer-main/base/base_trainer.py", line 78, in train
result = self._train_epoch(epoch)
File "/home/ch/sn_d/code/MVS/MVSFormer-main/trainer/mvsformer_trainer.py", line 164, in _train_epoch
self.scaler.step(self.optimizer)
File "/home/ch/anaconda3/envs/pt/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 412, in step
assert (
AssertionError: No inf checks were recorded for this optimizer.
请问作者用24G显存的显卡跑过有这样的问题吗?该如何解决
最后 十分感谢您的工作!