Issue during training VAE
Closed this issue · 4 comments
@ZENGXH Thanks again for the amazing work and quick reply to maintaining this repo
During training the VAE I run into this issue. While it is doing the evaluation
2023-03-16 01:28:50.786 | ERROR | utils.utils:init_processes:1156 - An error has been caught in function 'init_processes', process 'Process-1' (38392), thread 'MainThread' (140360210575360):
Traceback (most recent call last):
File "/home/alberto/Documents/LION/train_dist.py", line 239, in <module>
p.start()
│ └ <function BaseProcess.start at 0x7fa8273c2170>
└ <Process name='Process-1' parent=38345 started>
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
│ │ │ │ └ <Process name='Process-1' parent=38345 started>
│ │ │ └ <staticmethod(<function Process._Popen at 0x7fa8271f12d0>)>
│ │ └ <Process name='Process-1' parent=38345 started>
│ └ None
└ <Process name='Process-1' parent=38345 started>
File "/usr/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
│ │ └ <Process name='Process-1' parent=38345 started>
│ └ <function DefaultContext.get_context at 0x7fa8271f1480>
└ <multiprocessing.context.DefaultContext object at 0x7fa8273e68c0>
File "/usr/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
return Popen(process_obj)
│ └ <Process name='Process-1' parent=38345 started>
└ <class 'multiprocessing.popen_fork.Popen'>
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
│ │ └ <Process name='Process-1' parent=38345 started>
│ └ <function Popen._launch at 0x7fa67ea62950>
└ <multiprocessing.popen_fork.Popen object at 0x7fa67eb5a110>
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch
code = process_obj._bootstrap(parent_sentinel=child_r)
│ │ └ 7
│ └ <function BaseProcess._bootstrap at 0x7fa8273c2a70>
└ <Process name='Process-1' parent=38345 started>
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
│ └ <function BaseProcess.run at 0x7fa8273c20e0>
└ <Process name='Process-1' parent=38345 started>
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └ <Process name='Process-1' parent=38345 started>
│ │ │ └ (0, 2, <function main at 0x7fa67ea62560>, Namespace(exp_root='../exp', skip_sample=0, skip_nll=0, ntest=None, dataset='cifar1...
│ │ └ <Process name='Process-1' parent=38345 started>
│ └ <function init_processes at 0x7fa67ea617e0>
└ <Process name='Process-1' parent=38345 started>
> File "/home/alberto/Documents/LION/utils/utils.py", line 1156, in init_processes
fn(args, config)
│ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
│ └ Namespace(exp_root='../exp', skip_sample=0, skip_nll=0, ntest=None, dataset='cifar10', data='/tmp/nvae-diff/data', autocast_t...
└ <function main at 0x7fa67ea62560>
File "/home/alberto/Documents/LION/train_dist.py", line 84, in main
trainer.train_epochs()
│ └ <function BaseTrainer.train_epochs at 0x7fa7d4598040>
└ <trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>
File "/home/alberto/Documents/LION/trainers/base_trainer.py", line 285, in train_epochs
eval_score = self.eval_nll(step=step, save_file=False)
│ │ └ 7599
│ └ <function BaseTrainer.eval_nll at 0x7fa7d4598700>
└ <trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>
File "/home/alberto/Documents/LION/my_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
│ │ └ {'step': 7599, 'save_file': False}
│ └ (<trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>,)
└ <function BaseTrainer.eval_nll at 0x7fa7d4598670>
File "/home/alberto/Documents/LION/trainers/base_trainer.py", line 805, in eval_nll
results = compute_NLL_metric(
└ <function compute_NLL_metric at 0x7fa7e5c50af0>
File "/home/alberto/Documents/LION/utils/eval_helper.py", line 59, in compute_NLL_metric
pair_vis(gen_pcs[worse_ten], ref_pcs[worse_ten],
│ │ │ │ └ tensor([266, 263, 265, 51, 122, 91, 323, 298, 101, 319], device='cuda:0')
│ │ │ └ tensor([[[-3.3173e-02, 4.3725e-02, -7.8650e-02],
│ │ │ [-3.2106e-02, -7.7591e-02, -2.6546e-02],
│ │ │ [-7.3885e-03, -6...
│ │ └ tensor([266, 263, 265, 51, 122, 91, 323, 298, 101, 319], device='cuda:0')
│ └ tensor([[[-0.0330, 0.0436, -0.0783],
│ [-0.0322, -0.0777, -0.0267],
│ [-0.0076, -0.0620, 0.0369],
│ .....
└ <function pair_vis at 0x7fa7e5c50940>
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
2023-03-16 01:28:50.928 | INFO | __main__:<module>:243 - join 1
But for some reasons it didn't stop the training, the terminal was hanging there
I changed line 59 in eval_helper.py
to
pair_vis(gen_pcs[worse_ten].to(device), ref_pcs[worse_ten].to(device),
titles, subtitles, writer, step=step)
Let's see if this works. How can I change the eval also to be less than 256 or bigger?
It seem worse_ten
is cuda tensor and gen_pcs
is cpu tensor; my torch version (1.10.2) seems to be okish with such indexing; what's the torch version you are using?
- I just push a hot fix and convert
worse_ten
to cpu; could you try again? - What do you mean by
eval less than 256
? Are you referring to the batch-size or the size of eval data?
-
yes it is working now. I am using Pytorch 2.0 cu117. ( with 2GPUs A6000 Ada it took me around around 24 hours = 12.5km driven by average ICE car for a total of 3.11kg of CO_2 credit https://mlco2.github.io/impact/ I happy to share the weights of the VAE (DM me) to other researchers if it can be helpful and reduce CO_2 emission. Thank you so much.
-
np, I fixed also that part. thanks for the clarification.
Hi @albertotono is there any way you could share those weights you mentioned? Thank you so much in advance. If you could send them by email (mail: michallatkos@gmail.com), I would be extremely grateful.