Training crashes at the same spot for both Shepard Metzler datasets

Question

Training crashes at the same spot for both Shepard Metzler datasets

Closed this issue 5 years ago · 4 comments

Some context:

I downloaded and converted the datasets via data.sh and set batch size to 12. Note that I am using TensorFlow 1.14 for reading the tfrecord files and converting them.
I use gpu.sh to run the training script. I set the batch size to either of [1,12,36,72] and DataParallel to True to use 4 GPUs

But after a shrot time I get the following errors if I use any batch size higher than 1. This happens on iterations 40, 13 and 6 with batch sizes 12, 36 and 72. This happens for both Shepard Metzler datasets.
Why I am getting these errors?
Does batch size 1 on the training code mean reading one of the .pt.gz files? If so, setting batch size to 1 in the training script should actually mean 12. Would that be correct?

Here's what I get for the data set with 5 parts when I set batch size to 36 for instance:

Epoch [1/200]: [13/1856]   1%|▊                                                                                                                       , elbo=-2.1e+4, kl=827, mu=5e-6, sigma=2 [00:21<52:34]Current run is terminating due to exception: Caught RuntimeError in DataLoader worker process 13.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 12 and 8 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:689
.
Engine run is terminating due to exception: Caught RuntimeError in DataLoader worker process 13.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 12 and 8 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:689
.
Traceback (most recent call last):
  File "../run-gqn.py", line 183, in <module>
    trainer.run(train_loader, args.n_epochs)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 850, in run
    return self._internal_run()
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 952, in _internal_run
    self._handle_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 714, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 607, in _fire_event
    func(self, *(event_args + args), **kwargs)
  File "../run-gqn.py", line 181, in handle_exception
    else: raise e
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 937, in _internal_run
    hours, mins, secs = self._run_once_on_dataset()
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 705, in _run_once_on_dataset
    self._handle_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 714, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 607, in _fire_event
    func(self, *(event_args + args), **kwargs)
  File "../run-gqn.py", line 181, in handle_exception
    else: raise e
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 655, in _run_once_on_dataset
    batch = next(self._dataloader_iter)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 801, in __next__
    return self._process_data(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 13.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 12 and 8 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:689

Answer 1 · 2020-10-19T02:46:08.000Z

@wohlert I'm afraid this problem is partly related to the version of packages I'm using because I cannot run the code anymore after installing another version of some packages (and also CUDA). Could you please tell me what the versions of the following packages were used when you wrote this code?

CUDA
pytorch
pytorch-ignite
torchvision
tensorflow
tensorboardX

Thank you

EDIT: I ran the code again with CUDA 10.1, PyTorch v.1.2.0, torchvision 0.4.0, pytorch-ignite 0.3.0, tensorboardX 1.9 and got the same errors that I got with CUDA 10.0, PyTorch 1.3.0 (also 1.1.0), torchvision 0.4.1 (also 0.3.0), pytorch-ignite 0.2.0 and tensorboardX 1.9. As you can see in the errors, this is not related to the batch size problem that I had before. I also tried the code with a much older version of the packages that were released sometime around December 2018 (still used CUDA 10.0) but got some other errors. Do you have any guess on what might be causing me getting these errors?

Current run is terminating due to exception: Caught UnpicklingError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/om2/vast//arsalans/unsupervised-localization/shepardmetzler.py", line 50, in __getitem__
    data = torch.load(f)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 563, in _load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.
.
Engine run is terminating due to exception: Caught UnpicklingError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/om2/vast/arsalans/unsupervised-localization/shepardmetzler.py", line 50, in __getitem__
    data = torch.load(f)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 563, in _load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.
.
Traceback (most recent call last):
  File "../run-gqn.py", line 183, in <module>
    trainer.run(train_loader, args.n_epochs)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 850, in run
    return self._internal_run()
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 952, in _internal_run
    self._handle_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 714, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 607, in _fire_event
    func(self, *(event_args + args), **kwargs)
  File "../run-gqn.py", line 181, in handle_exception
    else: raise e
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 937, in _internal_run
    hours, mins, secs = self._run_once_on_dataset()
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 705, in _run_once_on_dataset
    self._handle_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 714, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 607, in _fire_event
    func(self, *(event_args + args), **kwargs)
  File "../run-gqn.py", line 181, in handle_exception
    else: raise e
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 655, in _run_once_on_dataset
    batch = next(self._dataloader_iter)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
_pickle.UnpicklingError: Caught UnpicklingError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/om2/vast/arsalans/unsupervised-localization/shepardmetzler.py", line 50, in __getitem__
    data = torch.load(f)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 563, in _load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

Answer 2 · 2020-10-20T06:00:56.000Z

Can you see whether the dependencies specified in the environment file will work?

https://github.com/wohlert/generative-query-network-pytorch/blob/master/environment.yml

Answer 3 · 2020-10-20T12:23:54.000Z

@wohlert Thanks for the pointer. I can try those dependencies as well but they're a bit too old. Do you know if it would be possible to use newer versions of CUDA and PyTorch? I basically want to use newer NVIDIA GPUs and the minimum required version of CUDA is 10.0 for them. It would be great if you know the a more recent version of the core dependencies that people have used successfully to run your code with.
Also, is there a specific reason for using Python 3.5? Do you think if people have tried your code with Python 3.6 without any issues?

Answer 4 · 2020-10-24T04:39:43.000Z

Apparently for the last set of error messages I had done a mistake during data conversion and the data was somehow not converted properly. For the main issue I posted earlier, the problem got resolved by setting the batch size to 64 during data conversion and then to 1 in the Python code.