Error finding on the fly horizontal features in custom dataset training
Closed this issue · 9 comments
We're training on a large (~70GB, 2493270064 points, 80/10/10 train/test/val split). We've got some pre-superpoint-transformer preprocessing in place -- primiarly, using lastile and lasground_new to pretile the set so that each tile is between 74934 and 6674441 points, and is flattened, and rejecting lasfiles that are too small or otherwise malformed.
Custom dataset yaml for datamodule and experiment as follows:
datamodule:
defaults:
- semantic/default.yaml
_target_: src.datamodules.custom_dataset.CUSTOM_DATASET_DataModule
# These parameters are not actually used by the DataModule, but are used
# here to facilitate model parameterization with config interpolation
num_classes: 7
trainval: False
val_on_test: True
xy_tiling: 3
load_full_res_idx: True
# Features that will be computed, saved, loaded for points and segments
error_folder_path: ""
original_ground_index: 1
ID2TRAINID: [7, 0, 1, 1, 1, 0, 0, 6, 2, 7, 1, 7, 7, 7, 7, 7, 3, 7, 7, 7, 4, 7, 5]
log_every_n_steps: 20
max_intensity: 5000
# point features used for the partition
partition_hf:
- 'linearity'
- 'planarity'
- 'scattering'
- 'elevation'
# point features used for training
point_hf:
- 'intensity'
- 'linearity'
- 'planarity'
- 'scattering'
- 'verticality'
- 'elevation'
# segment-wise features computed at preprocessing
segment_base_hf: []
# segment features computed as the mean of point feature in each
# segment, saved with "mean_" prefix
segment_mean_hf: []
# segment features computed as the std of point feature in each segment,
# saved with "std_" prefix
segment_std_hf: []
# horizontal edge features used for training
edge_hf:
- 'mean_off'
- 'std_off'
- 'mean_dist'
- 'angle_source'
- 'angle_target'
- 'centroid_dir'
- 'centroid_dist'
- 'normal_angle'
- 'log_length'
- 'log_surface'
- 'log_volume'
- 'log_size'
v_edge_hf: [] # vertical edge features used for training
# Parameters declared here to facilitate tuning configs without copying
# all the pre_transforms
voxel: 0.1
knn: 25
knn_r: 10
knn_step: -1
knn_min_search: 10
ground_threshold: 5
ground_scale: 20
pcp_regularization: [0.1, 0.2, 0.3]
pcp_spatial_weight: [1e-1, 1e-2, 1e-3]
pcp_cutoff: [10, 30, 100]
pcp_k_adjacency: 10
pcp_w_adjacency: 1
pcp_iterations: 15
graph_k_min: 1
graph_k_max: 30
graph_gap: [5, 30, 30]
graph_se_ratio: 0.3
graph_se_min: 20
graph_cycles: 3
graph_margin: 0.5
graph_chunk: [1e6, 1e5, 1e5] # reduce if CUDA memory errors
# Batch construction parameterization
sample_segment_ratio: 0.2
sample_segment_by_size: True
sample_segment_by_class: False
sample_point_min: 32
sample_point_max: 128
sample_graph_r: 50 # set to r<=0 to skip SampleRadiusSubgraphs
sample_graph_k: 4
sample_graph_disjoint: True
sample_edge_n_min: -1 # [5, 5, 15]
sample_edge_n_max: -1 # [10, 15, 25]
# Augmentations parameterization
pos_jitter: 0.05
tilt_n_rotate_phi: 0.1
tilt_n_rotate_theta: 180
anisotropic_scaling: 0.2
node_feat_jitter: 0
h_edge_feat_jitter: 0
v_edge_feat_jitter: 0
node_feat_drop: 0
h_edge_feat_drop: 0.3
v_edge_feat_drop: 0
node_row_drop: 0
h_edge_row_drop: 0
v_edge_row_drop: 0
drop_to_mean: False
# Preprocessing
pre_transform:
- transform: SaveNodeIndex
params:
key: 'sub'
- transform: DataTo
params:
device: 'cuda'
- transform: GridSampling3D
params:
size: ${datamodule.voxel}
hist_key: 'y'
hist_size: ${eval:'${datamodule.num_classes} + 1'}
- transform: KNN
params:
k: ${datamodule.knn}
r_max: ${datamodule.knn_r}
verbose: False
- transform: DataTo
params:
device: 'cpu'
- transform: GroundElevation
params:
threshold: ${datamodule.ground_threshold}
scale: ${datamodule.ground_scale}
- transform: PointFeatures
params:
keys: ${datamodule.point_hf_preprocess}
k_min: 1
k_step: ${datamodule.knn_step}
k_min_search: ${datamodule.knn_min_search}
- transform: DataTo
params:
device: 'cuda'
- transform: AdjacencyGraph
params:
k: ${datamodule.pcp_k_adjacency}
w: ${datamodule.pcp_w_adjacency}
- transform: ConnectIsolated
params:
k: 1
- transform: DataTo
params:
device: 'cpu'
- transform: AddKeysTo # move some features to 'x' to be used for partition
params:
keys: ${datamodule.partition_hf}
to: 'x'
delete_after: False
- transform: CutPursuitPartition
params:
regularization: ${datamodule.pcp_regularization}
spatial_weight: ${datamodule.pcp_spatial_weight}
k_adjacency: ${datamodule.pcp_k_adjacency}
cutoff: ${datamodule.pcp_cutoff}
iterations: ${datamodule.pcp_iterations}
parallel: True
verbose: False
num_classes: ${datamodule.num_classes}
- transform: NAGRemoveKeys # remove 'x' used for partition (features are still preserved under their respective Data attributes)
params:
level: 'all'
keys: 'x'
- transform: NAGTo
params:
device: 'cuda'
- transform: SegmentFeatures
params:
n_min: 32
n_max: 128
keys: ${datamodule.segment_base_hf_preprocess}
mean_keys: ${datamodule.segment_mean_hf_preprocess}
std_keys: ${datamodule.segment_std_hf_preprocess}
strict: False # will not raise error if a mean or std key is missing
- transform: RadiusHorizontalGraph
params:
k_min: ${datamodule.graph_k_min}
k_max: ${datamodule.graph_k_max}
gap: ${datamodule.graph_gap}
se_ratio: ${datamodule.graph_se_ratio}
se_min: ${datamodule.graph_se_min}
cycles: ${datamodule.graph_cycles}
margin: ${datamodule.graph_margin}
chunk_size: ${datamodule.graph_chunk}
halfspace_filter: True
bbox_filter: True
target_pc_flip: True
source_pc_sort: False
keys: ['mean_off', 'std_off', 'mean_dist' ]
- transform: NAGTo
params:
device: 'cpu'
# CPU-based train transforms
train_transform: null
# CPU-based val transforms
val_transform: ${datamodule.train_transform}
# CPU-based test transforms
test_transform: ${datamodule.val_transform}
# GPU-based train transforms
on_device_train_transform:
# Apply sampling transforms first to reduce the number of nodes and
# edges. These operations are compute-intensive and are the reason
# why these transforms are not performed on CPU
- transform: SampleSubNodes
params:
low: 0
high: 1
n_min: ${datamodule.sample_point_min}
n_max: ${datamodule.sample_point_max}
- transform: SampleRadiusSubgraphs
params:
r: ${datamodule.sample_graph_r}
k: ${datamodule.sample_graph_k}
i_level: 1
by_size: False
by_class: False
disjoint: ${datamodule.sample_graph_disjoint}
- transform: SampleSegments
params:
ratio: ${datamodule.sample_segment_ratio}
by_size: ${datamodule.sample_segment_by_size}
by_class: ${datamodule.sample_segment_by_class}
- transform: NAGRestrictSize
params:
level: '1+'
num_nodes: ${datamodule.max_num_nodes}
# Cast all attributes to either float or long. Doing this only now
# allows speeding up disk I/O and CPU->GPU transfer
- transform: NAGCast
# Apply geometric transforms affecting position, offsets, normals
# before calling transforms relying on those, such as on-the-fly
# edge features computation
- transform: NAGJitterKey
params:
key: 'pos'
sigma: ${datamodule.pos_jitter}
trunc: ${datamodule.voxel}
- transform: RandomTiltAndRotate
params:
phi: ${datamodule.tilt_n_rotate_phi}
theta: ${datamodule.tilt_n_rotate_theta}
- transform: RandomAnisotropicScale
params:
delta: ${datamodule.anisotropic_scaling}
- transform: RandomAxisFlip
params:
p: 0.5
# Compute some horizontal and vertical edges on-the-fly. Those are
# only computed now since they can be deduced from point and node
# attributes. Besides, the OnTheFlyHorizontalEdgeFeatures transform
# takes a trimmed graph as input and doubles its size, creating j->i
# for each input i->j edge
- transform: OnTheFlyHorizontalEdgeFeatures
params:
keys: ${datamodule.edge_hf}
use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}
- transform: OnTheFlyVerticalEdgeFeatures
params:
keys: ${datamodule.v_edge_hf}
use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}
# Edge sampling is only performed after the horizontal graph is
# untrimmed by OnTheFlyHorizontalEdgeFeatures
- transform: SampleEdges
params:
level: '1+'
n_min: ${datamodule.sample_edge_n_min}
n_max: ${datamodule.sample_edge_n_max}
- transform: NAGRestrictSize
params:
level: '1+'
num_edges: ${datamodule.max_num_edges}
# Move all point and segment features to 'x'
- transform: NAGAddKeysTo
params:
level: 0
keys: ${eval:'ListConfig(${datamodule.point_hf})'}
to: 'x'
- transform: NAGAddKeysTo
params:
level: '1+'
keys: ${eval:'ListConfig(${datamodule.segment_hf})'}
to: 'x'
# Add some noise and randomly some point, node and edge features
- transform: NAGJitterKey
params:
key: 'x'
sigma: ${datamodule.node_feat_jitter}
trunc: ${eval:'2 * ${datamodule.node_feat_jitter}'}
- transform: NAGJitterKey
params:
key: 'edge_attr'
sigma: ${datamodule.h_edge_feat_jitter}
trunc: ${eval:'2 * ${datamodule.h_edge_feat_jitter}'}
- transform: NAGJitterKey
params:
key: 'v_edge_attr'
sigma: ${datamodule.v_edge_feat_jitter}
trunc: ${eval:'2 * ${datamodule.v_edge_feat_jitter}'}
- transform: NAGDropoutColumns
params:
p: ${datamodule.node_feat_drop}
key: 'x'
inplace: True
to_mean: ${datamodule.drop_to_mean}
- transform: NAGDropoutColumns
params:
p: ${datamodule.h_edge_feat_drop}
key: 'edge_attr'
inplace: True
to_mean: ${datamodule.drop_to_mean}
- transform: NAGDropoutColumns
params:
p: ${datamodule.v_edge_feat_drop}
key: 'v_edge_attr'
inplace: True
to_mean: ${datamodule.drop_to_mean}
- transform: NAGDropoutRows
params:
p: ${datamodule.node_row_drop}
key: 'x'
to_mean: ${datamodule.drop_to_mean}
- transform: NAGDropoutRows
params:
p: ${datamodule.h_edge_row_drop}
key: 'edge_attr'
to_mean: ${datamodule.drop_to_mean}
- transform: NAGDropoutRows
params:
p: ${datamodule.v_edge_row_drop}
key: 'v_edge_attr'
to_mean: ${datamodule.drop_to_mean}
# Add self-loops in the horizontal graph
- transform: NAGAddSelfLoops
# Add a `node_size` attribute to all segments, this is needed for
# segment-wise position normalization with UnitSphereNorm
- transform: NodeSize
# GPU-based val transforms
on_device_val_transform:
# # According to @drprojects, removing this transform will allow for exact point inferencing
# # Apply sampling transforms first to reduce the number of nodes and
# # edges. These operations are compute-intensive and are the reason
# # why these transforms are not performed on CPU
# - transform: SampleSubNodes
# params:
# low: 0
# high: 1
# n_min: 128
# n_max: 256
# Cast all attributes to either float or long. Doing this only now
# allows speeding up disk I/O and CPU->GPU transfer
- transform: NAGCast
# Compute some horizontal and vertical edges on-the-fly. Those are
# only computed now since they can be deduced from point and node
# attributes. Besides, the OnTheFlyHorizontalEdgeFeatures transform
# takes a trimmed graph as input and doubles its size, creating j->i
# for each input i->j edge
- transform: OnTheFlyHorizontalEdgeFeatures
params:
keys: ${datamodule.edge_hf}
use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}
- transform: OnTheFlyVerticalEdgeFeatures
params:
keys: ${datamodule.v_edge_hf}
use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}
# Move all point and segment features to 'x'
- transform: NAGAddKeysTo
params:
level: 0
keys: ${eval:'ListConfig(${datamodule.point_hf})'}
to: 'x'
- transform: NAGAddKeysTo
params:
level: '1+'
keys: ${eval:'ListConfig(${datamodule.segment_hf})'}
to: 'x'
# Add self-loops in the horizontal graph
- transform: NAGAddSelfLoops
# Add a `node_size` attribute to all segments, this is needed for
# segment-wise position normalization with UnitSphereNorm
- transform: NodeSize
# GPU-based test transforms
on_device_test_transform: ${datamodule.on_device_val_transform}
experiment:
# @package _global_
# to execute this experiment run:
# python train.py experiment=dales
defaults:
- override /datamodule: custom_dataset.yaml
- override /model: semantic/spt-2.yaml
# - override /model: nano-2.yaml # When inferencing OR training a nano-2 model, this must be set when both training and inferencing
- override /trainer: gpu.yaml
# all parameters below will be merged with parameters from default configurations set above
# this allows you to overwrite only specified parameters
datamodule:
# xy_tiling: [2, 2] # split each floor into xy_tiling²=25 tiles, based on a regular XY grid. Reduces preprocessing- and inference-time GPU memory
xy_tiling: 1
sample_graph_k: 2 # 2 spherical samples in each batch instead of 4. Reduces train-time GPU memory
callbacks:
gradient_accumulator:
scheduling:
0:
2 # accumulate gradient every 2 batches, to make up for reduced batch size
trainer:
# max_epochs: 288 # to keep same nb of steps: 25/9x more tiles, 2-step gradient accumulation -> epochs * 2 * 9 / 25
max_epochs: 800 # to keep same nb of steps: 25/9x more tiles, 2-step gradient accumulation -> epochs * 2 * 9 / 9
model:
optimizer:
lr: 0.01
weight_decay: 1e-4
logger:
wandb:
project: "spt_custom"
name: "SPT-64"
When training with this dataset, we get between 100 and 200 epochs in and then get the following crash:
Error executing job with overrides: ['experiment=custom_dataset', 'datamodule.data_dir=/home/gdi-user/datasets/first_energy/pre_tiled_dataset', 'datamodule.ma
x_intensity=4113.66996555']
Traceback (most recent call last):
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/train.py", line 139, in main
metric_dict, _ = train(cfg)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/utils.py", line 48, in wrap
raise ex
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/utils.py", line 45, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/train.py", line 114, in train
trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_hand
le_interrupt
return trainer_fn(*args, **kwargs)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
results = self._run_stage()
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1032, in _run_stag
e
self.fit_loop.run()
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 138, in
run
self.advance(data_fetcher)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 215, in
advance
batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=0)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strateg
y_hook
output = fn(*args, **kwargs)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 278, in batch_
to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 348, in _apply_batch_t
ransfer_handler
batch = self._call_batch_hook("on_after_batch_transfer", batch, dataloader_idx)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 336, in _call_batch_ho
ok
return trainer_method(trainer, hook_name, *args)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 179, in _call_lightni
ng_datamodule_hook
return fn(*args, **kwargs)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/datamodules/base.py", line 344, in on_after_batch_transfer
return on_device_transform(nag)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/torch_geometric/transforms/compose.py", line 24, in __call__
data = transform(data)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/transforms/transforms.py", line 23, in __call__
return self._process(x)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/transforms/graph.py", line 990, in _process
nag._list[i_level] = _on_the_fly_horizontal_edge_features(
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/transforms/graph.py", line 1011, in _on_the_fly_horizontal_edge_features
assert is_trimmed(se), \
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/graph.py", line 439, in is_trimmed
edge_index_trimmed = to_trimmed(edge_index)
File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/graph.py", line 408, in to_trimmed
s_larger_t = edge_index[0] > edge_index[1]
TypeError: 'NoneType' object is not subscriptable
so at some point at _on_the_fly_horizontal_edge_features
or above, data.edge_index
is None
.
Hi @gvoysey this might occur if the sampled batch contains a very small cloud. Typically, this could be that:
- some of your tiles contain very few points
- some of your tiles contain some outlying points or outlying small superpoints
Both of these situations may lead to spurious graphs with only 1 node after calling SampleSubNodes
and SampleRadiusSubgraphs
. This can be problematic at any level of the partition (level-0 excluded), because as of now, the code is not robust to these edge cases where single-node or empty graphs may be passed.
For deeper investigation, I suggest you save the NAG
to disk in OnTheFlyHorizontalEdgeFeatures
if one of the partition levels has no edge_index
(obviously enough, you need to do so before the error-prone call to _on_the_fly_horizontal_edge_features
). Even better, try to capture which cloud it comes from, to be able to reproduce this error consistently and investigate the problem more deeply.
we'll add that instrumentation and update. Is there a reliable way to detect this on the fly? If it happens during preprocessing, i'm curious why it doesn't arise every epoch, either.
This is likely a stochastic event happening at training time only, linked to the fact that SampleSubNodes
and OnTheFlyHorizontalEdgeFeatures
are designed for randomized batch construction. I suspect the issue is the conjunction of one of these and a specific cloud with outlying superpoints.
A reliable way for detecting it at train time is what I mentioned above. I suggest you try this first and find the cloud(s) tile(s) from which this error occurred. Once isolated, it will be easier to investigate the issue.
sounds good. My plan is to walk the preprocessed tree of *.h5 files with src.data.nag:Nag.load(...)
-- does this approach seem like it has any gotchas? I remember there's some subtleties in handling src.data.Data
w/r/t which keys get reinflated.
I do not think this will return any NAG
objects with empty data.edge_index
. As mentioned above, the issue arises only at training time, due to the conjunction of some unfortunate samplings and superpoint graphs with few nodes / with few neighbors. This is proven by the fact that you train for many epochs before the error randomly occurs.
So, use datamodule.train_dataloader()
to get a dataloader and loop over it multiple times instead.
# Loop over as many epoch as you'd like, until you encounter the error
for _ in range(num_trial_epochs):
# Reset the dataloader at each epoch
dataloader = datamodule.train_dataloader()
for nag_list in dataloader:
# Need to do this manually here because we are not using
# lightning's training loop syntax
nag = NAGBatch.from_nag_list([nag.cuda() for nag in nag_list])
nag = dataset.on_device_transform(nag)
# Test whatever
for i in range(1, nag.num_levels + 1):
if nag[i].edge_index is None or nag[i].edge_index.shape[1] == 0:
# Do something to store the data somewhere
# Ideally, you would be able to recover which
# preprocessed file it came from, but I leave this up to you
ah! ok, that looks like a closer replication of the error environment, while still being heaps faster!
Hi @gvoysey have you solved this issue ? May I close it ?
I think you can close this. We weren't able to add safeguards in the superpoint code to catch and continue when data.edge_index == None
, but we were able to successfully fully train a model after adjusting the gradient accumulator scheduling and our preprocessing tiling strategy.
I'm not amazingly confident in this approach since it's very much a rough heuristic that doesn't account for scene complexity, but we used a combination of rejecting small files and tiling large ones such that each lidar file in our train, val, and test sets contained a number of points in the closed interval [150_000, 2_000_000]
.
That let us train for ~800 epochs. Quality of results tbd and depend on many factors, but it didn't crash!
I see thanks for the feedback ! Well, that circumvented the problem. If you ever happen to isolate a problematic file, I can try to look into it. In the meantime, I am closing this issue.