Error finding on the fly horizontal features in custom dataset training

Question

Error finding on the fly horizontal features in custom dataset training

Closed this issue 6 months ago · 9 comments

We're training on a large (~70GB, 2493270064 points, 80/10/10 train/test/val split). We've got some pre-superpoint-transformer preprocessing in place -- primiarly, using lastile and lasground_new to pretile the set so that each tile is between 74934 and 6674441 points, and is flattened, and rejecting lasfiles that are too small or otherwise malformed.

Custom dataset yaml for datamodule and experiment as follows:

datamodule:

defaults:
  - semantic/default.yaml

_target_: src.datamodules.custom_dataset.CUSTOM_DATASET_DataModule

# These parameters are not actually used by the DataModule, but are used
# here to facilitate model parameterization with config interpolation
num_classes: 7
trainval: False
val_on_test: True
xy_tiling: 3
load_full_res_idx: True
# Features that will be computed, saved, loaded for points and segments
error_folder_path: ""
original_ground_index: 1
ID2TRAINID: [7, 0, 1, 1, 1, 0, 0, 6, 2, 7, 1, 7, 7, 7, 7, 7, 3, 7, 7, 7, 4, 7, 5]
log_every_n_steps: 20
max_intensity: 5000

# point features used for the partition
partition_hf:
  - 'linearity'
  - 'planarity'
  - 'scattering'
  - 'elevation'

# point features used for training
point_hf:
  - 'intensity'
  - 'linearity'
  - 'planarity'
  - 'scattering'
  - 'verticality'
  - 'elevation'

# segment-wise features computed at preprocessing
segment_base_hf: []

# segment features computed as the mean of point feature in each
# segment, saved with "mean_" prefix
segment_mean_hf: []

# segment features computed as the std of point feature in each segment,
# saved with "std_" prefix
segment_std_hf: []

# horizontal edge features used for training
edge_hf:
  - 'mean_off'
  - 'std_off'
  - 'mean_dist'
  - 'angle_source'
  - 'angle_target'
  - 'centroid_dir'
  - 'centroid_dist'
  - 'normal_angle'
  - 'log_length'
  - 'log_surface'
  - 'log_volume'
  - 'log_size'

v_edge_hf: []  # vertical edge features used for training

# Parameters declared here to facilitate tuning configs without copying
# all the pre_transforms
voxel: 0.1
knn: 25
knn_r: 10
knn_step: -1
knn_min_search: 10
ground_threshold: 5
ground_scale: 20
pcp_regularization: [0.1, 0.2, 0.3]
pcp_spatial_weight: [1e-1, 1e-2, 1e-3]
pcp_cutoff: [10, 30, 100]
pcp_k_adjacency: 10
pcp_w_adjacency: 1
pcp_iterations: 15
graph_k_min: 1
graph_k_max: 30
graph_gap: [5, 30, 30]
graph_se_ratio: 0.3
graph_se_min: 20
graph_cycles: 3
graph_margin: 0.5
graph_chunk: [1e6, 1e5, 1e5]  # reduce if CUDA memory errors

# Batch construction parameterization
sample_segment_ratio: 0.2
sample_segment_by_size: True
sample_segment_by_class: False
sample_point_min: 32
sample_point_max: 128
sample_graph_r: 50  # set to r<=0 to skip SampleRadiusSubgraphs
sample_graph_k: 4
sample_graph_disjoint: True
sample_edge_n_min: -1  # [5, 5, 15]
sample_edge_n_max: -1  # [10, 15, 25]

# Augmentations parameterization
pos_jitter: 0.05
tilt_n_rotate_phi: 0.1
tilt_n_rotate_theta: 180
anisotropic_scaling: 0.2
node_feat_jitter: 0
h_edge_feat_jitter: 0
v_edge_feat_jitter: 0
node_feat_drop: 0
h_edge_feat_drop: 0.3
v_edge_feat_drop: 0
node_row_drop: 0
h_edge_row_drop: 0
v_edge_row_drop: 0
drop_to_mean: False

# Preprocessing
pre_transform:
    - transform: SaveNodeIndex
      params:
        key: 'sub'
    - transform: DataTo
      params:
        device: 'cuda'
    - transform: GridSampling3D
      params:
        size: ${datamodule.voxel}
        hist_key: 'y'
        hist_size: ${eval:'${datamodule.num_classes} + 1'}
    - transform: KNN
      params:
        k: ${datamodule.knn}
        r_max: ${datamodule.knn_r}
        verbose: False
    - transform: DataTo
      params:
        device: 'cpu'
    - transform: GroundElevation
      params:
        threshold: ${datamodule.ground_threshold}
        scale: ${datamodule.ground_scale}
    - transform: PointFeatures
      params:
        keys: ${datamodule.point_hf_preprocess}
        k_min: 1
        k_step: ${datamodule.knn_step}
        k_min_search: ${datamodule.knn_min_search}
    - transform: DataTo
      params:
        device: 'cuda'
    - transform: AdjacencyGraph
      params:
        k: ${datamodule.pcp_k_adjacency}
        w: ${datamodule.pcp_w_adjacency}
    - transform: ConnectIsolated
      params:
        k: 1
    - transform: DataTo
      params:
        device: 'cpu'
    - transform: AddKeysTo  # move some features to 'x' to be used for partition
      params:
        keys: ${datamodule.partition_hf}
        to: 'x'
        delete_after: False
    - transform: CutPursuitPartition
      params:
        regularization: ${datamodule.pcp_regularization}
        spatial_weight: ${datamodule.pcp_spatial_weight}
        k_adjacency: ${datamodule.pcp_k_adjacency}
        cutoff: ${datamodule.pcp_cutoff}
        iterations: ${datamodule.pcp_iterations}
        parallel: True
        verbose: False
        num_classes: ${datamodule.num_classes}
    - transform: NAGRemoveKeys  # remove 'x' used for partition (features are still preserved under their respective Data attributes)
      params:
        level: 'all'
        keys: 'x'
    - transform: NAGTo
      params:
        device: 'cuda'
    - transform: SegmentFeatures
      params:
        n_min: 32
        n_max: 128
        keys: ${datamodule.segment_base_hf_preprocess}
        mean_keys: ${datamodule.segment_mean_hf_preprocess}
        std_keys: ${datamodule.segment_std_hf_preprocess}
        strict: False  # will not raise error if a mean or std key is missing
    - transform: RadiusHorizontalGraph
      params:
        k_min: ${datamodule.graph_k_min}
        k_max: ${datamodule.graph_k_max}
        gap: ${datamodule.graph_gap}
        se_ratio: ${datamodule.graph_se_ratio}
        se_min: ${datamodule.graph_se_min}
        cycles: ${datamodule.graph_cycles}
        margin: ${datamodule.graph_margin}
        chunk_size: ${datamodule.graph_chunk}
        halfspace_filter: True
        bbox_filter: True
        target_pc_flip: True
        source_pc_sort: False
        keys: ['mean_off', 'std_off', 'mean_dist' ]
    - transform: NAGTo
      params:
        device: 'cpu'

# CPU-based train transforms
train_transform: null

# CPU-based val transforms
val_transform: ${datamodule.train_transform}

# CPU-based test transforms
test_transform: ${datamodule.val_transform}

# GPU-based train transforms
on_device_train_transform:

    # Apply sampling transforms first to reduce the number of nodes and
    # edges. These operations are compute-intensive and are the reason
    # why these transforms are not performed on CPU
    - transform: SampleSubNodes
      params:
        low: 0
        high: 1
        n_min: ${datamodule.sample_point_min}
        n_max: ${datamodule.sample_point_max}
    - transform: SampleRadiusSubgraphs
      params:
        r: ${datamodule.sample_graph_r}
        k: ${datamodule.sample_graph_k}
        i_level: 1
        by_size: False
        by_class: False
        disjoint: ${datamodule.sample_graph_disjoint}
    - transform: SampleSegments
      params:
        ratio: ${datamodule.sample_segment_ratio}
        by_size: ${datamodule.sample_segment_by_size}
        by_class: ${datamodule.sample_segment_by_class}
    - transform: NAGRestrictSize
      params:
        level: '1+'
        num_nodes: ${datamodule.max_num_nodes}

    # Cast all attributes to either float or long. Doing this only now
    # allows speeding up disk I/O and CPU->GPU transfer
    - transform: NAGCast

    # Apply geometric transforms affecting position, offsets, normals
    # before calling transforms relying on those, such as on-the-fly
    # edge features computation
    - transform: NAGJitterKey
      params:
        key: 'pos'
        sigma: ${datamodule.pos_jitter}
        trunc: ${datamodule.voxel}
    - transform: RandomTiltAndRotate
      params:
        phi: ${datamodule.tilt_n_rotate_phi}
        theta: ${datamodule.tilt_n_rotate_theta}
    - transform: RandomAnisotropicScale
      params:
        delta: ${datamodule.anisotropic_scaling}
    - transform: RandomAxisFlip
      params:
        p: 0.5

    # Compute some horizontal and vertical edges on-the-fly. Those are
    # only computed now since they can be deduced from point and node
    # attributes. Besides, the OnTheFlyHorizontalEdgeFeatures transform
    # takes a trimmed graph as input and doubles its size, creating j->i
    # for each input i->j edge
    - transform: OnTheFlyHorizontalEdgeFeatures
      params:
        keys: ${datamodule.edge_hf}
        use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}
    - transform: OnTheFlyVerticalEdgeFeatures
      params:
        keys: ${datamodule.v_edge_hf}
        use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}

    # Edge sampling is only performed after the horizontal graph is
    # untrimmed by OnTheFlyHorizontalEdgeFeatures
    - transform: SampleEdges
      params:
        level: '1+'
        n_min: ${datamodule.sample_edge_n_min}
        n_max: ${datamodule.sample_edge_n_max}
    - transform: NAGRestrictSize
      params:
        level: '1+'
        num_edges: ${datamodule.max_num_edges}

    # Move all point and segment features to 'x'
    - transform: NAGAddKeysTo
      params:
        level: 0
        keys: ${eval:'ListConfig(${datamodule.point_hf})'}
        to: 'x'
    - transform: NAGAddKeysTo
      params:
        level: '1+'
        keys: ${eval:'ListConfig(${datamodule.segment_hf})'}
        to: 'x'

    # Add some noise and randomly some point, node and edge features
    - transform: NAGJitterKey
      params:
        key: 'x'
        sigma: ${datamodule.node_feat_jitter}
        trunc: ${eval:'2 * ${datamodule.node_feat_jitter}'}
    - transform: NAGJitterKey
      params:
        key: 'edge_attr'
        sigma: ${datamodule.h_edge_feat_jitter}
        trunc: ${eval:'2 * ${datamodule.h_edge_feat_jitter}'}
    - transform: NAGJitterKey
      params:
        key: 'v_edge_attr'
        sigma: ${datamodule.v_edge_feat_jitter}
        trunc: ${eval:'2 * ${datamodule.v_edge_feat_jitter}'}
    - transform: NAGDropoutColumns
      params:
        p: ${datamodule.node_feat_drop}
        key: 'x'
        inplace: True
        to_mean: ${datamodule.drop_to_mean}
    - transform: NAGDropoutColumns
      params:
        p: ${datamodule.h_edge_feat_drop}
        key: 'edge_attr'
        inplace: True
        to_mean: ${datamodule.drop_to_mean}
    - transform: NAGDropoutColumns
      params:
        p: ${datamodule.v_edge_feat_drop}
        key: 'v_edge_attr'
        inplace: True
        to_mean: ${datamodule.drop_to_mean}
    - transform: NAGDropoutRows
      params:
        p: ${datamodule.node_row_drop}
        key: 'x'
        to_mean: ${datamodule.drop_to_mean}
    - transform: NAGDropoutRows
      params:
        p: ${datamodule.h_edge_row_drop}
        key: 'edge_attr'
        to_mean: ${datamodule.drop_to_mean}
    - transform: NAGDropoutRows
      params:
        p: ${datamodule.v_edge_row_drop}
        key: 'v_edge_attr'
        to_mean: ${datamodule.drop_to_mean}

    # Add self-loops in the horizontal graph
    - transform: NAGAddSelfLoops

    # Add a `node_size` attribute to all segments, this is needed for
    # segment-wise position normalization with UnitSphereNorm
    - transform: NodeSize

# GPU-based val transforms
on_device_val_transform:

# # According to @drprojects, removing this transform will allow for exact point inferencing
#    # Apply sampling transforms first to reduce the number of nodes and
#    # edges. These operations are compute-intensive and are the reason
#    # why these transforms are not performed on CPU
#    - transform: SampleSubNodes
#      params:
#        low: 0
#        high: 1
#        n_min: 128
#        n_max: 256

    # Cast all attributes to either float or long. Doing this only now
    # allows speeding up disk I/O and CPU->GPU transfer
    - transform: NAGCast

    # Compute some horizontal and vertical edges on-the-fly. Those are
    # only computed now since they can be deduced from point and node
    # attributes. Besides, the OnTheFlyHorizontalEdgeFeatures transform
    # takes a trimmed graph as input and doubles its size, creating j->i
    # for each input i->j edge
    - transform: OnTheFlyHorizontalEdgeFeatures
      params:
        keys: ${datamodule.edge_hf}
        use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}
    - transform: OnTheFlyVerticalEdgeFeatures
      params:
        keys: ${datamodule.v_edge_hf}
        use_mean_normal: ${eval:'"normal" in ${datamodule.segment_mean_hf}'}

    # Move all point and segment features to 'x'
    - transform: NAGAddKeysTo
      params:
        level: 0
        keys: ${eval:'ListConfig(${datamodule.point_hf})'}
        to: 'x'
    - transform: NAGAddKeysTo
      params:
        level: '1+'
        keys: ${eval:'ListConfig(${datamodule.segment_hf})'}
        to: 'x'

    # Add self-loops in the horizontal graph
    - transform: NAGAddSelfLoops

    # Add a `node_size` attribute to all segments, this is needed for
    # segment-wise position normalization with UnitSphereNorm
    - transform: NodeSize

# GPU-based test transforms
on_device_test_transform: ${datamodule.on_device_val_transform}

experiment:

# @package _global_

# to execute this experiment run:
# python train.py experiment=dales

defaults:
  - override /datamodule: custom_dataset.yaml
  - override /model: semantic/spt-2.yaml
#  - override /model: nano-2.yaml # When inferencing OR training a nano-2 model, this must be set when both training and inferencing
  - override /trainer: gpu.yaml

# all parameters below will be merged with parameters from default configurations set above
# this allows you to overwrite only specified parameters

datamodule:
#  xy_tiling: [2, 2]  # split each floor into xy_tiling²=25 tiles, based on a regular XY grid. Reduces preprocessing- and inference-time GPU memory
  xy_tiling: 1
  sample_graph_k: 2  # 2 spherical samples in each batch instead of 4. Reduces train-time GPU memory

callbacks:
  gradient_accumulator:
    scheduling:
      0:
        2  # accumulate gradient every 2 batches, to make up for reduced batch size

trainer:
#  max_epochs: 288  # to keep same nb of steps: 25/9x more tiles, 2-step gradient accumulation -> epochs * 2 * 9 / 25
  max_epochs: 800  # to keep same nb of steps: 25/9x more tiles, 2-step gradient accumulation -> epochs * 2 * 9 / 9

model:
  optimizer:
    lr: 0.01
    weight_decay: 1e-4

logger:
  wandb:
    project: "spt_custom"
    name: "SPT-64"

When training with this dataset, we get between 100 and 200 epochs in and then get the following crash:

Error executing job with overrides: ['experiment=custom_dataset', 'datamodule.data_dir=/home/gdi-user/datasets/first_energy/pre_tiled_dataset', 'datamodule.ma
x_intensity=4113.66996555']
Traceback (most recent call last):
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/train.py", line 139, in main
    metric_dict, _ = train(cfg)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/utils.py", line 48, in wrap
    raise ex
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/utils.py", line 45, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/train.py", line 114, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_hand
le_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1032, in _run_stag
e
    self.fit_loop.run()
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 138, in
run
    self.advance(data_fetcher)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 215, in
advance
    batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=0)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strateg
y_hook
    output = fn(*args, **kwargs)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 278, in batch_
to_device
    return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 348, in _apply_batch_t
ransfer_handler
    batch = self._call_batch_hook("on_after_batch_transfer", batch, dataloader_idx)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 336, in _call_batch_ho
ok
    return trainer_method(trainer, hook_name, *args)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 179, in _call_lightni
ng_datamodule_hook
    return fn(*args, **kwargs)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/datamodules/base.py", line 344, in on_after_batch_transfer
    return on_device_transform(nag)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/venv/lib/python3.10/site-packages/torch_geometric/transforms/compose.py", line 24, in __call__
    data = transform(data)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/transforms/transforms.py", line 23, in __call__
    return self._process(x)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/transforms/graph.py", line 990, in _process
    nag._list[i_level] = _on_the_fly_horizontal_edge_features(
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/transforms/graph.py", line 1011, in _on_the_fly_horizontal_edge_features
    assert is_trimmed(se), \
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/graph.py", line 439, in is_trimmed
    edge_index_trimmed = to_trimmed(edge_index)
  File "/home/gdi-user/researcher/Projects/superpoint_transformer/src/utils/graph.py", line 408, in to_trimmed
    s_larger_t = edge_index[0] > edge_index[1]
TypeError: 'NoneType' object is not subscriptable

so at some point at _on_the_fly_horizontal_edge_features or above, data.edge_index is None.

Answer 1 · 2024-04-02T16:37:53.000Z

Hi @gvoysey this might occur if the sampled batch contains a very small cloud. Typically, this could be that:

some of your tiles contain very few points
some of your tiles contain some outlying points or outlying small superpoints

Both of these situations may lead to spurious graphs with only 1 node after calling SampleSubNodes and SampleRadiusSubgraphs. This can be problematic at any level of the partition (level-0 excluded), because as of now, the code is not robust to these edge cases where single-node or empty graphs may be passed.

For deeper investigation, I suggest you save the NAG to disk in OnTheFlyHorizontalEdgeFeatures if one of the partition levels has no edge_index (obviously enough, you need to do so before the error-prone call to _on_the_fly_horizontal_edge_features). Even better, try to capture which cloud it comes from, to be able to reproduce this error consistently and investigate the problem more deeply.

Answer 2 · 2024-04-02T16:52:07.000Z

we'll add that instrumentation and update. Is there a reliable way to detect this on the fly? If it happens during preprocessing, i'm curious why it doesn't arise every epoch, either.

Answer 3 · 2024-04-02T19:30:47.000Z

This is likely a stochastic event happening at training time only, linked to the fact that SampleSubNodes and OnTheFlyHorizontalEdgeFeatures are designed for randomized batch construction. I suspect the issue is the conjunction of one of these and a specific cloud with outlying superpoints.

A reliable way for detecting it at train time is what I mentioned above. I suggest you try this first and find the cloud(s) tile(s) from which this error occurred. Once isolated, it will be easier to investigate the issue.

Answer 4 · 2024-04-03T15:07:14.000Z

sounds good. My plan is to walk the preprocessed tree of *.h5 files with src.data.nag:Nag.load(...) -- does this approach seem like it has any gotchas? I remember there's some subtleties in handling src.data.Data w/r/t which keys get reinflated.

Answer 5 · 2024-04-03T15:57:57.000Z

I do not think this will return any NAG objects with empty data.edge_index. As mentioned above, the issue arises only at training time, due to the conjunction of some unfortunate samplings and superpoint graphs with few nodes / with few neighbors. This is proven by the fact that you train for many epochs before the error randomly occurs.

So, use datamodule.train_dataloader() to get a dataloader and loop over it multiple times instead.

# Loop over as many epoch as you'd like, until you encounter the error
for _ in range(num_trial_epochs):

    # Reset the dataloader at each epoch
    dataloader = datamodule.train_dataloader()

    for nag_list in dataloader:

        # Need to do this manually here because we are not using 
        # lightning's training loop syntax
        nag = NAGBatch.from_nag_list([nag.cuda() for nag in nag_list])
        nag = dataset.on_device_transform(nag)

        # Test whatever 
        for i in range(1, nag.num_levels + 1):
            if nag[i].edge_index is None or nag[i].edge_index.shape[1] == 0:
                # Do something to store the data somewhere
                # Ideally, you would be able to recover which 
                # preprocessed file it came from, but I leave this up to you

Answer 6 · 2024-04-03T16:00:20.000Z

ah! ok, that looks like a closer replication of the error environment, while still being heaps faster!

Answer 7 · 2024-04-19T13:32:41.000Z

Hi @gvoysey have you solved this issue ? May I close it ?

Answer 8 · 2024-04-19T13:53:13.000Z

I think you can close this. We weren't able to add safeguards in the superpoint code to catch and continue when data.edge_index == None, but we were able to successfully fully train a model after adjusting the gradient accumulator scheduling and our preprocessing tiling strategy.

I'm not amazingly confident in this approach since it's very much a rough heuristic that doesn't account for scene complexity, but we used a combination of rejecting small files and tiling large ones such that each lidar file in our train, val, and test sets contained a number of points in the closed interval [150_000, 2_000_000].

That let us train for ~800 epochs. Quality of results tbd and depend on many factors, but it didn't crash!

Answer 9 · 2024-04-19T14:00:45.000Z

I see thanks for the feedback ! Well, that circumvented the problem. If you ever happen to isolate a problematic file, I can try to look into it. In the meantime, I am closing this issue.