hmorimitsu/ptlflow

training is not working for craft, flowformer, gmflownet, gmflow

nihalgupta84 opened this issue ยท 16 comments

self._result = self.closure(*args, **kwargs)

File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
step_output = self._step_fn()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 427, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step
return self.model.training_step(*args, **kwargs)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/models/base_model/base_model.py", line 229, in training_step
loss = self.loss_fn(preds, batch)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/models/gmflow/gmflow.py", line 40, in forward
flow_loss += i_weight * (valid[:, None] * i_loss).mean()
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 2
Epoch 0: 0%| | 0/30712 [00:08<?, ?it/s]

can you please check the code?

Thank you for reporting, I'll check it later.

Just please notice that the training stage has not been tested, so there's no guarantee that the trained models will generate good results in the end.

Best,

i'll try to check every stage and will push the result.

But you need to check the flow estimator part.

it will be very appriciable, if you add some documentation about training.

i have tried training with batch_size for craft

iw worked.

ptl trainer inbuilt auto_batch_size is also not working.

it will be very appriciable, if you add some documentation about training.

There is a documentation at https://ptlflow.readthedocs.io/en/latest/starting/training.html.

Is there anything else in specific that you think it is missing?

I have pushed a fix for the losses in those models you mentioned.

I hope it is working now, but if not, let me know.

Best,

while resuming the training make_grid thorughing error

img_grid = self._make_image_grid(self.train_images)

File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid
grid = make_grid(imgs, len(imgs)//len(dl_images))
ZeroDivisionError: integer division or modulo by zero

and one more error occuring while resuming in vcn model

bcoz of optimizer weights issue

(ptlflow) anil@anil-gpu2:/media/anil/New Volume1/Nihal/ptlflow$ python3 train.py vcn --logger --enable_checkpointing --gpus 2 --log_every_n_steps 100 --enable_progress_bar True --max_steps 100000 --train_batch_size 1 --train_dataset chairs-train --val_dataset chairs-val --accelerator gpu --strategy ddp_sharded --resume_from_checkpoint "/media/anil/New Volume1/Nihal/ptlflow/ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_last_epoch=6_step=77812.ckpt"
05/04/2023 13:30:49 - INFO: Loading faiss with AVX2 support.
05/04/2023 13:30:49 - INFO: Successfully loaded faiss with AVX2 support.
Global seed set to 1234
05/04/2023 13:30:49 - INFO: Created a temporary directory at /tmp/tmps09nib89
05/04/2023 13:30:49 - INFO: Writing /tmp/tmps09nib89/_remote_module_non_scriptable.py
05/04/2023 13:31:09 - INFO: Loading 640 samples from FlyingChairs dataset.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Global seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
05/04/2023 13:31:12 - INFO: Loading faiss with AVX2 support.
05/04/2023 13:31:12 - INFO: Successfully loaded faiss with AVX2 support.
Global seed set to 1234
05/04/2023 13:31:12 - INFO: Created a temporary directory at /tmp/tmp_bzotlcs
05/04/2023 13:31:12 - INFO: Writing /tmp/tmp_bzotlcs/_remote_module_non_scriptable.py
05/04/2023 13:31:35 - INFO: Loading 640 samples from FlyingChairs dataset.
Global seed set to 1234
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
05/04/2023 13:31:35 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
05/04/2023 13:31:35 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
05/04/2023 13:31:35 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
05/04/2023 13:31:35 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

Restoring states from the checkpoint path at /media/anil/New Volume1/Nihal/ptlflow/ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_last_epoch=6_step=77812.ckpt
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
05/04/2023 13:31:41 - WARNING: --train_crop_size is not set. It will be set as (320, 448).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
05/04/2023 13:31:41 - WARNING: --train_crop_size is not set. It will be set as (320, 448).
05/04/2023 13:31:44 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/04/2023 13:31:44 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/04/2023 13:31:46 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters
05/04/2023 13:31:46 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters

| Name | Type | Params

0 | loss_fn | VCNLoss | 0
1 | train_metrics | FlowMetrics | 0
2 | val_metrics | FlowMetrics | 0
3 | pspnet | pspnet | 1.8 M
4 | f6 | butterfly4D | 49.4 K
5 | p6 | sepConv4d | 4.6 K
6 | f5 | butterfly4D | 49.4 K
7 | p5 | sepConv4d | 4.6 K
8 | f4 | butterfly4D | 49.4 K
9 | p4 | sepConv4d | 4.6 K
10 | f3 | butterfly4D | 48.4 K
11 | p3 | sepConv4d | 4.6 K
12 | flow_reg64 | flow_reg | 0
13 | flow_reg32 | flow_reg | 0
14 | flow_reg16 | flow_reg | 0
15 | flow_reg8 | flow_reg | 0
16 | warp5 | WarpModule | 0
17 | warp4 | WarpModule | 0
18 | warp3 | WarpModule | 0
19 | warpx | WarpModule | 0
20 | dc6_conv1 | Sequential | 221 K
21 | dc6_conv2 | Sequential | 147 K
22 | dc6_conv3 | Sequential | 147 K
23 | dc6_conv4 | Sequential | 110 K
24 | dc6_conv5 | Sequential | 55.5 K
25 | dc6_conv6 | Sequential | 18.5 K
26 | dc6_conv7 | Conv2d | 9.2 K
27 | dc5_conv1 | Sequential | 295 K
28 | dc5_conv2 | Sequential | 147 K
29 | dc5_conv3 | Sequential | 147 K
30 | dc5_conv4 | Sequential | 110 K
31 | dc5_conv5 | Sequential | 55.5 K
32 | dc5_conv6 | Sequential | 18.5 K
33 | dc5_conv7 | Conv2d | 18.5 K
34 | dc4_conv1 | Sequential | 369 K
35 | dc4_conv2 | Sequential | 147 K
36 | dc4_conv3 | Sequential | 147 K
37 | dc4_conv4 | Sequential | 110 K
38 | dc4_conv5 | Sequential | 55.5 K
39 | dc4_conv6 | Sequential | 18.5 K
40 | dc4_conv7 | Conv2d | 27.7 K
41 | dc3_conv1 | Sequential | 369 K
42 | dc3_conv2 | Sequential | 147 K
43 | dc3_conv3 | Sequential | 147 K
44 | dc3_conv4 | Sequential | 110 K
45 | dc3_conv5 | Sequential | 55.5 K
46 | dc3_conv6 | Sequential | 18.5 K
47 | dc3_conv7 | Conv2d | 37.0 K
48 | dc6_convo | Sequential | 702 K
49 | dc5_convo | Sequential | 776 K
50 | dc4_convo | Sequential | 849 K
51 | dc3_convo | Sequential | 849 K
52 | f2 | butterfly4D | 27.9 K
53 | p2 | sepConv4d | 2.6 K
54 | flow_reg4 | flow_reg | 0
55 | warp2 | WarpModule | 0
56 | dc2_conv1 | Sequential | 424 K
57 | dc2_conv2 | Sequential | 147 K
58 | dc2_conv3 | Sequential | 147 K
59 | dc2_conv4 | Sequential | 110 K
60 | dc2_conv5 | Sequential | 55.5 K
61 | dc2_conv6 | Sequential | 18.5 K
62 | dc2_conv7 | Conv2d | 43.9 K
63 | dc2_convo | Sequential | 905 K

10.3 M Trainable params
0 Non-trainable params
10.3 M Total params
41.243 Total estimated model params size (MB)
Traceback (most recent call last):
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in
train(args)
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 112, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1233, in _run
self._checkpoint_connector.restore_training_state()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 204, in restore_training_state
self.restore_optimizers_and_schedulers()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 306, in restore_optimizers_and_schedulers
raise KeyError(
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to ModelCheckpoint.save_weights_only being set to True.'
Traceback (most recent call last):
File "train.py", line 153, in
train(args)
File "train.py", line 112, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1233, in _run
self._checkpoint_connector.restore_training_state()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 204, in restore_training_state
self.restore_optimizers_and_schedulers()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 306, in restore_optimizers_and_schedulers
raise KeyError(
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to ModelCheckpoint.save_weights_only being set to True.'

Thank you. I'll take a look at make_grid later.

The resuming problem is caused because your example is trying to resume from the "last" checkpoint, which does not contain training states. To solve, you should resume from the "train" checkpoint.

Hope it helps.

thanks for the quick reply

still facing issue while resuming the training for all models

and error we get from make_grid function so i have tried to add exception to handle this but got more error while,
can you look into this

(ptlflow) anil@anil-gpu2:/media/anil/New Volume1/Nihal/ptlflow$ python3 train.py vcn --logger --enable_checkpointing --gpus 2 --log_every_n_steps 1000 --enable_progress_bar True --max_steps 100000 --train_batch_size 2 --train_dataset chairs-train --val_dataset chairs-val --accelerator gpu --strategy ddp_sharded --resume_from_checkpoint "ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt"
05/05/2023 19:25:24 - INFO: Loading faiss with AVX2 support.
05/05/2023 19:25:24 - INFO: Successfully loaded faiss with AVX2 support.
Global seed set to 1234
05/05/2023 19:25:24 - INFO: Created a temporary directory at /tmp/tmpyl682fg1
05/05/2023 19:25:24 - INFO: Writing /tmp/tmpyl682fg1/_remote_module_non_scriptable.py
05/05/2023 19:25:43 - INFO: Loading 640 samples from FlyingChairs dataset.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Global seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
05/05/2023 19:25:46 - INFO: Loading faiss with AVX2 support.
05/05/2023 19:25:46 - INFO: Successfully loaded faiss with AVX2 support.
Global seed set to 1234
05/05/2023 19:25:46 - INFO: Created a temporary directory at /tmp/tmpwrz_g7s4
05/05/2023 19:25:46 - INFO: Writing /tmp/tmpwrz_g7s4/_remote_module_non_scriptable.py
05/05/2023 19:26:04 - INFO: Loading 640 samples from FlyingChairs dataset.
Global seed set to 1234
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
05/05/2023 19:26:04 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
05/05/2023 19:26:04 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
05/05/2023 19:26:04 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
05/05/2023 19:26:04 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

Restoring states from the checkpoint path at ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
05/05/2023 19:26:38 - WARNING: --train_crop_size is not set. It will be set as (320, 448).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
05/05/2023 19:26:38 - WARNING: --train_crop_size is not set. It will be set as (320, 448).
05/05/2023 19:26:40 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/05/2023 19:26:40 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/05/2023 19:26:41 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters
05/05/2023 19:26:41 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters

| Name | Type | Params

0 | loss_fn | VCNLoss | 0
1 | train_metrics | FlowMetrics | 0
2 | val_metrics | FlowMetrics | 0
3 | pspnet | pspnet | 1.8 M
4 | f6 | butterfly4D | 49.4 K
5 | p6 | sepConv4d | 4.6 K
6 | f5 | butterfly4D | 49.4 K
7 | p5 | sepConv4d | 4.6 K
8 | f4 | butterfly4D | 49.4 K
9 | p4 | sepConv4d | 4.6 K
10 | f3 | butterfly4D | 48.4 K
11 | p3 | sepConv4d | 4.6 K
12 | flow_reg64 | flow_reg | 0
13 | flow_reg32 | flow_reg | 0
14 | flow_reg16 | flow_reg | 0
15 | flow_reg8 | flow_reg | 0
16 | warp5 | WarpModule | 0
17 | warp4 | WarpModule | 0
18 | warp3 | WarpModule | 0
19 | warpx | WarpModule | 0
20 | dc6_conv1 | Sequential | 221 K
21 | dc6_conv2 | Sequential | 147 K
22 | dc6_conv3 | Sequential | 147 K
23 | dc6_conv4 | Sequential | 110 K
24 | dc6_conv5 | Sequential | 55.5 K
25 | dc6_conv6 | Sequential | 18.5 K
26 | dc6_conv7 | Conv2d | 9.2 K
27 | dc5_conv1 | Sequential | 295 K
28 | dc5_conv2 | Sequential | 147 K
29 | dc5_conv3 | Sequential | 147 K
30 | dc5_conv4 | Sequential | 110 K
31 | dc5_conv5 | Sequential | 55.5 K
32 | dc5_conv6 | Sequential | 18.5 K
33 | dc5_conv7 | Conv2d | 18.5 K
34 | dc4_conv1 | Sequential | 369 K
35 | dc4_conv2 | Sequential | 147 K
36 | dc4_conv3 | Sequential | 147 K
37 | dc4_conv4 | Sequential | 110 K
38 | dc4_conv5 | Sequential | 55.5 K
39 | dc4_conv6 | Sequential | 18.5 K
40 | dc4_conv7 | Conv2d | 27.7 K
41 | dc3_conv1 | Sequential | 369 K
42 | dc3_conv2 | Sequential | 147 K
43 | dc3_conv3 | Sequential | 147 K
44 | dc3_conv4 | Sequential | 110 K
45 | dc3_conv5 | Sequential | 55.5 K
46 | dc3_conv6 | Sequential | 18.5 K
47 | dc3_conv7 | Conv2d | 37.0 K
48 | dc6_convo | Sequential | 702 K
49 | dc5_convo | Sequential | 776 K
50 | dc4_convo | Sequential | 849 K
51 | dc3_convo | Sequential | 849 K
52 | f2 | butterfly4D | 27.9 K
53 | p2 | sepConv4d | 2.6 K
54 | flow_reg4 | flow_reg | 0
55 | warp2 | WarpModule | 0
56 | dc2_conv1 | Sequential | 424 K
57 | dc2_conv2 | Sequential | 147 K
58 | dc2_conv3 | Sequential | 147 K
59 | dc2_conv4 | Sequential | 110 K
60 | dc2_conv5 | Sequential | 55.5 K
61 | dc2_conv6 | Sequential | 18.5 K
62 | dc2_conv7 | Conv2d | 43.9 K
63 | dc2_convo | Sequential | 905 K

10.3 M Trainable params
0 Non-trainable params
10.3 M Total params
41.243 Total estimated model params size (MB)
Restored all states from the checkpoint file at ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt
05/05/2023 19:26:45 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/05/2023 19:26:45 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/05/2023 19:27:02 - INFO: Loading 640 samples from FlyingChairs dataset.
05/05/2023 19:27:03 - INFO: Loading 640 samples from FlyingChairs dataset.
Epoch 10: 95%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 5558/5878 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 153, in
train(args)
File "train.py", line 112, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1637, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end
img_grid = self._make_image_grid(self.train_images)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid
grid = make_grid(imgs, len(imgs)//len(dl_images))
ZeroDivisionError: integer division or modulo by zero
Epoch 10: 95%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–| 5558/5878 [00:27<?, ?it/s]
Traceback (most recent call last):
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in
train(args)
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 112, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1637, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end
img_grid = self._make_image_grid(self.train_images)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid
grid = make_grid(imgs, len(imgs)//len(dl_images))
ZeroDivisionError: integer division or modulo by zero

Which version of pytorch and pytorch-lightning you are using?

pytorch-lightning 1.6.0
torch 1.12.0
torch-scatter 2.1.1
torchmetrics 0.9.0
torchvision 0.13.0

Could you upgrade pytorch-lightning to version 1.7.7 and try to resume again?

As you can see from the error, it is trying to resume from the end of an epoch, instead of the beginning. If I remember correctly, this was related to the lightning version.

However, do not try to install the latest pytorch-lightning either, as I have not tested with newer versions yet.

upgraded pytorch lightning to 1.7.7, now training started from begining of the epoch but still error the same

Epoch 10: 0%| | 0/5878 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 153, in
train(args)
File "train.py", line 112, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
self.on_advance_end()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 299, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end
img_grid = self._make_image_grid(self.train_images)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid
grid = make_grid(imgs, len(imgs)//len(dl_images))
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in
train(args)
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 112, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
self.on_advance_end()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 299, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end
img_grid = self._make_image_grid(self.train_images)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid
grid = make_grid(imgs, len(imgs)//len(dl_images))
ZeroDivisionError: integer division or modulo by zero
Epoch 10: 0%| | 0/5878 [00:21<?, ?it/s]

Pushed a fix to check if the outputs are empty in #49, it should solve this problem.

Please pull the new version and try again.

Traceback (most recent call last):
File "train.py", line 167, in
train(args)
File "train.py", line 113, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 289, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/fairscale/optim/oss.py", line 232, in step
loss = self.optim.step(closure=closure, **kwargs) # type: ignore
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step
adamw(params_with_grad,
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw
func(params,
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tensor_adamw
assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.
Traceback (most recent call last):
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 167, in
train(args)
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 113, in train
trainer.fit(model)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 289, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/fairscale/optim/oss.py", line 232, in step
loss = self.optim.step(closure=closure, **kwargs) # type: ignore
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step
adamw(params_with_grad,
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw
func(params,
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tensor_adamw
assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.
Epoch 11: 0%| | 0/5878 [00:23<?, ?it/s, loss=nan, v_num=4]

this error originastes only when we are loading checkpoints for resuming the training or finetuning

It seems that this is a problem with pytorch 1.12.0. The solution seems to be to upgrade to 1.12.1. See more here:

pytorch/pytorch#80809