nv-tlabs/LION

Issue during training VAE

Closed this issue · 4 comments

@ZENGXH Thanks again for the amazing work and quick reply to maintaining this repo

During training the VAE I run into this issue. While it is doing the evaluation

2023-03-16 01:28:50.786 | ERROR    | utils.utils:init_processes:1156 - An error has been caught in function 'init_processes', process 'Process-1' (38392), thread 'MainThread' (140360210575360):
Traceback (most recent call last):

  File "/home/alberto/Documents/LION/train_dist.py", line 239, in <module>
    p.start()
    │ └ <function BaseProcess.start at 0x7fa8273c2170>
    └ <Process name='Process-1' parent=38345 started>

  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
    │    │        │    │      └ <Process name='Process-1' parent=38345 started>
    │    │        │    └ <staticmethod(<function Process._Popen at 0x7fa8271f12d0>)>
    │    │        └ <Process name='Process-1' parent=38345 started>
    │    └ None
    └ <Process name='Process-1' parent=38345 started>
  File "/usr/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           │                │                            └ <Process name='Process-1' parent=38345 started>
           │                └ <function DefaultContext.get_context at 0x7fa8271f1480>
           └ <multiprocessing.context.DefaultContext object at 0x7fa8273e68c0>
  File "/usr/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
           │     └ <Process name='Process-1' parent=38345 started>
           └ <class 'multiprocessing.popen_fork.Popen'>
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
    │    │       └ <Process name='Process-1' parent=38345 started>
    │    └ <function Popen._launch at 0x7fa67ea62950>
    └ <multiprocessing.popen_fork.Popen object at 0x7fa67eb5a110>
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
           │           │                          └ 7
           │           └ <function BaseProcess._bootstrap at 0x7fa8273c2a70>
           └ <Process name='Process-1' parent=38345 started>
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x7fa8273c20e0>
    └ <Process name='Process-1' parent=38345 started>
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <Process name='Process-1' parent=38345 started>
    │    │        │    └ (0, 2, <function main at 0x7fa67ea62560>, Namespace(exp_root='../exp', skip_sample=0, skip_nll=0, ntest=None, dataset='cifar1...
    │    │        └ <Process name='Process-1' parent=38345 started>
    │    └ <function init_processes at 0x7fa67ea617e0>
    └ <Process name='Process-1' parent=38345 started>

> File "/home/alberto/Documents/LION/utils/utils.py", line 1156, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(exp_root='../exp', skip_sample=0, skip_nll=0, ntest=None, dataset='cifar10', data='/tmp/nvae-diff/data', autocast_t...
    └ <function main at 0x7fa67ea62560>

  File "/home/alberto/Documents/LION/train_dist.py", line 84, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7fa7d4598040>
    └ <trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>

  File "/home/alberto/Documents/LION/trainers/base_trainer.py", line 285, in train_epochs
    eval_score = self.eval_nll(step=step, save_file=False)
                 │    │             └ 7599
                 │    └ <function BaseTrainer.eval_nll at 0x7fa7d4598700>
                 └ <trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>

  File "/home/alberto/Documents/LION/my_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {'step': 7599, 'save_file': False}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>,)
           └ <function BaseTrainer.eval_nll at 0x7fa7d4598670>

  File "/home/alberto/Documents/LION/trainers/base_trainer.py", line 805, in eval_nll
    results = compute_NLL_metric(
              └ <function compute_NLL_metric at 0x7fa7e5c50af0>

  File "/home/alberto/Documents/LION/utils/eval_helper.py", line 59, in compute_NLL_metric
    pair_vis(gen_pcs[worse_ten], ref_pcs[worse_ten],
    │        │       │           │       └ tensor([266, 263, 265,  51, 122,  91, 323, 298, 101, 319], device='cuda:0')
    │        │       │           └ tensor([[[-3.3173e-02,  4.3725e-02, -7.8650e-02],
    │        │       │                      [-3.2106e-02, -7.7591e-02, -2.6546e-02],
    │        │       │                      [-7.3885e-03, -6...
    │        │       └ tensor([266, 263, 265,  51, 122,  91, 323, 298, 101, 319], device='cuda:0')
    │        └ tensor([[[-0.0330,  0.0436, -0.0783],
    │                   [-0.0322, -0.0777, -0.0267],
    │                   [-0.0076, -0.0620,  0.0369],
    │                   .....
    └ <function pair_vis at 0x7fa7e5c50940>

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
2023-03-16 01:28:50.928 | INFO     | __main__:<module>:243 - join 1

But for some reasons it didn't stop the training, the terminal was hanging there

I changed line 59 in eval_helper.py to

pair_vis(gen_pcs[worse_ten].to(device), ref_pcs[worse_ten].to(device),
                 titles, subtitles, writer, step=step)

Let's see if this works. How can I change the eval also to be less than 256 or bigger?

ZENGXH commented

It seem worse_ten is cuda tensor and gen_pcs is cpu tensor; my torch version (1.10.2) seems to be okish with such indexing; what's the torch version you are using?

  • I just push a hot fix and convert worse_ten to cpu; could you try again?
  • What do you mean by eval less than 256? Are you referring to the batch-size or the size of eval data?
  • yes it is working now. I am using Pytorch 2.0 cu117. ( with 2GPUs A6000 Ada it took me around around 24 hours = 12.5km driven by average ICE car for a total of 3.11kg of CO_2 credit https://mlco2.github.io/impact/ I happy to share the weights of the VAE (DM me) to other researchers if it can be helpful and reduce CO_2 emission. Thank you so much.

  • np, I fixed also that part. thanks for the clarification.

Latkos commented

Hi @albertotono is there any way you could share those weights you mentioned? Thank you so much in advance. If you could send them by email (mail: michallatkos@gmail.com), I would be extremely grateful.