CUDA assertion failure when trying to resume splatfacto training
Opened this issue · 0 comments
Describe the bug
I train a splatfacto model for 7000 iterations. I then try to resume training from the checkpoint to continue training to 30000 iterations and I get a crash
To Reproduce
Steps to reproduce the behavior:
- Train a splatfacto model for 7000 iterations, something like
ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 7000 colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4
- Then try to resume training to 30000 iterations
ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 30000 --load-dir outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4
- Observe a crash which looks to be caused by numerous index out of bounds assertion failures in a CUDA kernel
Logs from crash
(venv) abroun@Desktop-22:/src$ ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 30000 --load-dir outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4 --colmap-path sparse/0
/src/nerfstudio/field_components/activations.py:32: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd(cast_inputs=torch.float32)
/src/nerfstudio/field_components/activations.py:39: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, g):
──────────────────────────────────────────────────────── Config ────────────────────────────────────────────────────────
TrainerConfig(
_target=<class 'nerfstudio.engine.trainer.Trainer'>,
output_dir=PosixPath('outputs'),
method_name='splatfacto',
experiment_name=None,
project_name='nerfstudio-project',
timestamp='2024-12-19_170734',
machine=MachineConfig(seed=42, num_devices=1, num_machines=1, machine_rank=0, dist_url='auto', device_type='cuda'),
logging=LoggingConfig(
relative_log_dir=PosixPath('.'),
steps_per_log=10,
max_buffer_size=20,
local_writer=LocalWriterConfig(
_target=<class 'nerfstudio.utils.writer.LocalWriter'>,
enable=True,
stats_to_track=(
<EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>,
<EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>,
<EventName.CURR_TEST_PSNR: 'Test PSNR'>,
<EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>,
<EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>,
<EventName.ETA: 'ETA (time)'>
),
max_log_size=10
),
profiler='basic'
),
viewer=ViewerConfig(
relative_log_filename='viewer_log_filename.txt',
websocket_port=None,
websocket_port_default=7007,
websocket_host='0.0.0.0',
num_rays_per_chunk=32768,
max_num_display_images=512,
quit_on_train_completion=False,
image_format='jpeg',
jpeg_quality=75,
make_share_url=False,
camera_frustum_scale=0.1,
default_composite_depth=True
),
pipeline=VanillaPipelineConfig(
_target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>,
datamanager=FullImageDatamanagerConfig(
_target=<class 'nerfstudio.data.datamanagers.full_images_datamanager.FullImageDatamanager'>,
data=None,
masks_on_gpu=False,
images_on_gpu=False,
dataparser=ColmapDataParserConfig(
_target=<class 'nerfstudio.data.dataparsers.colmap_dataparser.ColmapDataParser'>,
data=PosixPath('/media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/p
ersonal/knaresborough_castle/museum_20241212/complete_colmap'),
scale_factor=1.0,
downscale_factor=4,
downscale_rounding_mode='floor',
scene_scale=1.0,
orientation_method='up',
center_method='poses',
auto_scale_poses=True,
assume_colmap_world_coordinate_convention=True,
eval_mode='interval',
train_split_fraction=0.9,
eval_interval=8,
depth_unit_scale_factor=0.001,
images_path=PosixPath('images'),
masks_path=None,
depths_path=None,
colmap_path=PosixPath('sparse/0'),
load_3D_points=True,
max_2D_matches_per_3D_point=0
),
camera_res_scale_factor=1.0,
eval_num_images_to_sample_from=-1,
eval_num_times_to_repeat_images=-1,
eval_image_indices=(0,),
cache_images='gpu',
cache_images_type='uint8',
max_thread_workers=None,
train_cameras_sampling_strategy='random',
train_cameras_sampling_seed=42,
fps_reset_every=100
),
model=SplatfactoModelConfig(
_target=<class 'nerfstudio.models.splatfacto.SplatfactoModel'>,
enable_collider=True,
collider_params={'near_plane': 2.0, 'far_plane': 6.0},
loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0},
eval_num_rays_per_chunk=4096,
prompt=None,
warmup_length=500,
refine_every=100,
resolution_schedule=3000,
background_color='random',
num_downscales=2,
cull_alpha_thresh=0.1,
cull_scale_thresh=0.5,
reset_alpha_every=30,
densify_grad_thresh=0.0008,
use_absgrad=True,
densify_size_thresh=0.01,
n_split_samples=2,
sh_degree_interval=1000,
cull_screen_size=0.15,
split_screen_size=0.05,
stop_screen_size_at=4000,
random_init=False,
num_random=50000,
random_scale=10.0,
ssim_lambda=0.2,
stop_split_at=15000,
sh_degree=3,
use_scale_regularization=False,
max_gauss_ratio=10.0,
output_depth_during_training=False,
rasterize_mode='classic',
camera_optimizer=CameraOptimizerConfig(
_target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>,
mode='off',
trans_l2_penalty=0.01,
rot_l2_penalty=0.001,
optimizer=None,
scheduler=None
),
use_bilateral_grid=False,
grid_shape=(16, 16, 8),
color_corrected_metrics=False
)
),
optimizers={
'means': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.00016,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': ExponentialDecaySchedulerConfig(
_target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
lr_pre_warmup=1e-08,
lr_final=1.6e-06,
warmup_steps=0,
max_steps=30000,
ramp='cosine'
)
},
'features_dc': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.0025,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': None
},
'features_rest': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.000125,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': None
},
'opacities': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.05,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': None
},
'scales': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.005,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': None
},
'quats': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.001,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': None
},
'camera_opt': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.0001,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': ExponentialDecaySchedulerConfig(
_target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
lr_pre_warmup=0,
lr_final=5e-07,
warmup_steps=1000,
max_steps=30000,
ramp='cosine'
)
},
'bilateral_grid': {
'optimizer': AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.002,
eps=1e-15,
max_norm=None,
weight_decay=0
),
'scheduler': ExponentialDecaySchedulerConfig(
_target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
lr_pre_warmup=0,
lr_final=0.0001,
warmup_steps=1000,
max_steps=30000,
ramp='cosine'
)
}
},
vis='viewer+tensorboard',
data=None,
prompt=None,
relative_model_dir=PosixPath('nerfstudio_models'),
load_scheduler=True,
steps_per_save=2000,
steps_per_eval_batch=0,
steps_per_eval_image=100,
steps_per_eval_all_images=1000,
max_num_iterations=30000,
mixed_precision=False,
use_grad_scaler=False,
save_only_latest_checkpoint=True,
load_dir=PosixPath('outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models'),
load_step=None,
load_config=None,
load_checkpoint=None,
log_gradients=False,
gradient_accumulation_steps={},
start_paused=False
)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[17:07:34] Saving config to: outputs/unnamed/splatfacto/2024-12-19_170734/config.yml experiment_config.py:136
/src/nerfstudio/engine/trainer.py:137: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.grad_scaler = GradScaler(enabled=self.use_grad_scaler)
Saving checkpoints to: outputs/unnamed/splatfacto/2024-12-19_170734/nerfstudio_models trainer.py:142
Train dataset has over 500 images, overriding cache_images to cpu
/src/venv/lib/python3.10/site-packages/torchmetrics/functional/image/lpips.py:325: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
self.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)
╭─────────────── viser ───────────────╮
│ ╷ │
│ HTTP │ http://0.0.0.0:7007 │
│ Websocket │ ws://0.0.0.0:7007 │
│ ╵ │
╰─────────────────────────────────────╯
[17:07:48] Caching / undistorting eval images full_images_datamanager.py:230
Loading latest Nerfstudio checkpoint from load_dir...�����������������������������������������������������������������������������������������������������������
/src/nerfstudio/engine/trainer.py:432: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
loaded_state = torch.load(load_path, map_location="cpu")
Done loading Nerfstudio checkpoint from
outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models/step-000006999.ckpt
logging events to: outputs/unnamed/splatfacto/2024-12-19_170734
[17:08:08] Caching / undistorting train images full_images_datamanager.py:230
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [32,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [33,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [34,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [35,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [36,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [37,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [38,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [39,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [40,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [41,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [42,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [43,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [44,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [45,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [46,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [47,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [48,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [49,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [50,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [51,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [52,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [53,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [54,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [55,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [56,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [57,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [58,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [59,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [60,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
...
snip
...
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 14.4561
VanillaPipeline.get_train_loss_dict: 14.1312
Traceback (most recent call last):
File "/src/venv/bin/ns-train", line 8, in <module>
sys.exit(entrypoint())
File "/src/nerfstudio/scripts/train.py", line 272, in entrypoint
main(
File "/src/nerfstudio/scripts/train.py", line 257, in main
launch(
File "/src/nerfstudio/scripts/train.py", line 190, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "/src/nerfstudio/scripts/train.py", line 101, in train_loop
trainer.train()
File "/src/nerfstudio/engine/trainer.py", line 270, in train
callback.run_callback_at_location(
File "/src/nerfstudio/engine/callbacks.py", line 116, in run_callback_at_location
self.run_callback(step=step)
File "/src/nerfstudio/engine/callbacks.py", line 106, in run_callback
self.func(*self.args, **self.kwargs, step=step)
File "/src/nerfstudio/models/splatfacto.py", line 341, in step_post_backward
self.strategy.step_post_backward(
File "/src/venv/lib/python3.10/site-packages/gsplat/strategy/default.py", line 173, in step_post_backward
n_dupli, n_split = self._grow_gs(params, optimizers, state, step)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/src/venv/lib/python3.10/site-packages/gsplat/strategy/default.py", line 303, in _grow_gs
split(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/src/venv/lib/python3.10/site-packages/gsplat/strategy/ops.py", line 135, in split
sel = torch.where(mask)[0]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(venv) abroun@Desktop-22:/src$ ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 30000 --load-dir outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4 --colmap-path sparse/0
Expected behavior
Ideally I'd like to be able to resume training to experiment with adjusting learning rates, optimisation parameters etc
Additional context
My goal here is to experiment with different learning rates, optimisation parameters etc. Find some parameters that work well for a small number of iterations and then to adjust parameters further from that baseline. I think that it would be useful to be able to resume training from 7000 iterations so that I can explore adjusting parameters over time more quickly (rather than having to start from scratch each time). I'm not particularly familiar with the Nerfstudio project though so if there's a better way of acheiving this goal I'd be very grateful for some pointers. Cheers.