When I trained 11G S3DIS, there was an error

Question

When I trained 11G S3DIS, there was an error

Closed this issue 2 months ago · 6 comments

Traceback (most recent call last):
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/utils/utils.py", line 45, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "src/train.py", line 115, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1028, in _run_stage
    self._run_sanity_check()
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1057, in _run_sanity_check
    val_loop.run()
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 370, in _evaluation_step
    batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=dataloader_idx)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 277, in batch_to_device
    return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 359, in _apply_batch_transfer_handler
    batch = self._call_batch_hook("on_after_batch_transfer", batch, dataloader_idx)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 347, in _call_batch_hook
    return trainer_method(trainer, hook_name, *args)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 181, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/datamodules/base.py", line 333, in on_after_batch_transfer
    return on_device_transform(nag)
  File "/home/wcj/anaconda3/envs/spt7/lib/python3.8/site-packages/torch_geometric/transforms/compose.py", line 24, in __call__
    data = transform(data)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/transforms/transforms.py", line 23, in __call__
    return self._process(x)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/transforms/graph.py", line 1359, in _process
    nag[i_level].node_size = nag.get_sub_size(i_level, low=self.low)
  File "/media/wcj/A4D4C4CFD4C4A4C01/zl/superpoint_transformer-master/src/data/nag.py", line 58, in get_sub_size
    sub_sizes = self[low + 1].sub.sizes
AttributeError: 'list' object has no attribute 'sizes'

I print the related param in nag.py like:

print(self[low+1])
print(self[low + 1].sub)
print(type(self[low + 1].sub))

log_size=[19731, 1], log_surface=[19731, 1], log_volume=[19731, 1], normal=[19731, 3], super_index=[19731],
sub=[1], batch=[19731], ptr=[2])
[Cluster(num_clusters=19731, num_points=660524, device=cuda:0)]
<class 'list'>

Thanks all the help you provide

Answer 1 · 2024-07-16T22:12:14.000Z

It seems self[low + 1].sub is a List(Cluster) instead of simply being a Cluster. This is the first time I see this issue, I am not sure how it appeared yet. Have you made any modification to the code, even minor ? Can you please share the exact bash command are you running ?

If you ❤️ or use this project, don't forget to give it a ⭐, it means a lot to us !

Answer 2 · 2024-07-17T14:58:06.000Z

In order to ensure that I did not make any modifications, I re -decompressed the ZIP, and only pressed the data set in. When running, I often encounter PIPE LINE ERROR because of insufficient memory, but after re -execution, I successfully generated S3DIS data, and then I encountered the above error.
by the way,i find the WARN in Processing

if any problems in my method?

Answer 3 · 2024-07-17T15:18:38.000Z

I have the same problem when training on DALES dataset without any modification to the code. The only difference is that I ran on python venv instead of conda environment (but I don't think it really matters). Here the logs I've got:

output.log

and when I print self[low + 1].sub, the output is: [Cluster(num_clusters=42880, num_points=1324840, device=cuda:0), Cluster(num_clusters=35140, num_points=1080737, device=cuda:0), Cluster(num_clusters=30092, num_points=959261, device=cuda:0), Cluster(num_clusters=36576, num_points=1147082, device=cuda:0)]

I guest the issue lies in the process of packing data into batchs. In my case, the batch_size was 4 resulted a batch of 4 Clusters but in the type of List, and then the whole List was pushed into the transformation progress instead of a single Cluster, which maybe the reason of this problem.

Answer 4 · 2024-07-17T17:57:37.000Z

I have solved this issue by editting the line 933 of src/data/data.py, deleting and isinstance(batch.sub, Cluster) will work. After checked the previous version of this file, I've found that isinstance(batch.sub, Cluster) will always return False in this condition so the batch will be stuck in List data type instead of being convert to ClusterBatch.

Answer 5 · 2024-07-18T10:23:50.000Z

it works! Thanks!

Answer 6 · 2024-07-19T12:42:23.000Z

Good catch @va-kiet ! There was indeed an error there, since the PyG behavior of Batch.from_data_list() would return a List(Cluster) by default. Your fix was the correct one, I integrated this in the latest commit.