Error when run the 'scannet_val.sh' for training.
Opened this issue ยท 5 comments
I have adjusted the 'scannet_val.sh' file as follows:
#!/bin/bash
export OMP_NUM_THREADS=3 # speeds up MinkowskiEngine
CURR_DBSCAN=0.95
CURR_TOPK=500
CURR_QUERY=150
# TRAIN
python main_instance_segmentation.py \
general.experiment_name="validation" \
general.eval_on_segments=true \
general.train_on_segments=true \
data.voxel_size=0.02 \
data.batch_size=1
# # TEST
# python main_instance_segmentation.py \
# general.experiment_name="validation_query_${CURR_QUERY}_topk_${CURR_TOPK}_dbscan_${CURR_DBSCAN}" \
# general.project_name="scannet_eval" \
# general.checkpoint='checkpoints/scannet/scannet_val.ckpt' \
# general.train_mode=false \
# general.eval_on_segments=true \
# general.train_on_segments=true \
# model.num_queries=${CURR_QUERY} \
# general.topk_per_image=${CURR_TOPK} \
# general.use_dbscan=true \
# general.dbscan_eps=${CURR_DBSCAN}
I got the NaN result when I tried the 'scannet_val.sh'. Besides, The error of batch size being too large is also raised. Do you have any idea how to fix this (I'm using a GPU 3090 with 24GB memory). Thank you.
home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/utilities/seed.py:55: UserWarning: No seed found, seed set to 1277561742
rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 1277561742
{'_target_': 'pytorch_lightning.loggers.WandbLogger', 'project': '${general.project_name}', 'name': '${general.experiment_name}', 'save_dir': '${general.save_dir}', 'entity': 'trandangtrungduc', 'resume': 'allow', 'id': '${general.experiment_name}'}
wandb: Currently logged in as: trandangtrungduc. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.5 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.15.0
wandb: Run data is saved locally in saved/validation/wandb/run-20230718_105041-validation
wandb: Run `wandb offline` to turn off syncing.
wandb: Resuming run validation
wandb: โญ View project at https://wandb.ai/trandangtrungduc/scannet
wandb: ๐ View run at https://wandb.ai/trandangtrungduc/scannet/runs/validation
[2023-07-18 10:50:44,698][__main__][INFO] - {'general_train_mode': True, 'general_task': 'instance_segmentation', 'general_seed': None, 'general_checkpoint': None, 'general_backbone_checkpoint': None, 'general_freeze_backbone': False, 'general_linear_probing_backbone': False, 'general_train_on_segments': True, 'general_eval_on_segments': True, 'general_filter_out_instances': False, 'general_save_visualizations': False, 'general_visualization_point_size': 20, 'general_decoder_id': -1, 'general_export': False, 'general_use_dbscan': False, 'general_ignore_class_threshold': 100, 'general_project_name': 'scannet', 'general_workspace': 'jonasschult', 'general_experiment_name': 'validation', 'general_num_targets': 19, 'general_add_instance': True, 'general_dbscan_eps': 0.95, 'general_dbscan_min_points': 1, 'general_export_threshold': 0.0001, 'general_reps_per_epoch': 1, 'general_on_crops': False, 'general_scores_threshold': 0.0, 'general_iou_threshold': 1.0, 'general_area': 5, 'general_eval_inner_core': -1, 'general_topk_per_image': 100, 'general_ignore_mask_idx': [], 'general_max_batch_size': 8, 'general_save_dir': 'saved/validation', 'general_gpus': 1, 'data_train_mode': 'train', 'data_validation_mode': 'validation', 'data_test_mode': 'validation', 'data_ignore_label': 255, 'data_add_raw_coordinates': True, 'data_add_colors': True, 'data_add_normals': False, 'data_in_channels': 3, 'data_num_labels': 20, 'data_add_instance': True, 'data_task': 'instance_segmentation', 'data_pin_memory': False, 'data_num_workers': 4, 'data_batch_size': 1, 'data_test_batch_size': 1, 'data_cache_data': False, 'data_voxel_size': 0.02, 'data_reps_per_epoch': 1, 'data_cropping': False, 'data_cropping_args_min_points': 30000, 'data_cropping_args_aspect': 0.8, 'data_cropping_args_min_crop': 0.5, 'data_cropping_args_max_crop': 1.0, 'data_crop_min_size': 20000, 'data_crop_length': 6.0, 'data_cropping_v1': True, 'data_train_dataloader__target_': 'torch.utils.data.DataLoader', 'data_train_dataloader_shuffle': True, 'data_train_dataloader_pin_memory': False, 'data_train_dataloader_num_workers': 4, 'data_train_dataloader_batch_size': 1, 'data_validation_dataloader__target_': 'torch.utils.data.DataLoader', 'data_validation_dataloader_shuffle': False, 'data_validation_dataloader_pin_memory': False, 'data_validation_dataloader_num_workers': 4, 'data_validation_dataloader_batch_size': 1, 'data_test_dataloader__target_': 'torch.utils.data.DataLoader', 'data_test_dataloader_shuffle': False, 'data_test_dataloader_pin_memory': False, 'data_test_dataloader_num_workers': 4, 'data_test_dataloader_batch_size': 1, 'data_train_dataset__target_': 'datasets.semseg.SemanticSegmentationDataset', 'data_train_dataset_dataset_name': 'scannet', 'data_train_dataset_data_dir': 'data/processed/scannet', 'data_train_dataset_image_augmentations_path': 'conf/augmentation/albumentations_aug.yaml', 'data_train_dataset_volume_augmentations_path': 'conf/augmentation/volumentations_aug.yaml', 'data_train_dataset_label_db_filepath': 'data/processed/scannet/label_database.yaml', 'data_train_dataset_color_mean_std': 'data/processed/scannet/color_mean_std.yaml', 'data_train_dataset_data_percent': 1.0, 'data_train_dataset_mode': 'train', 'data_train_dataset_ignore_label': 255, 'data_train_dataset_num_labels': 20, 'data_train_dataset_add_raw_coordinates': True, 'data_train_dataset_add_colors': True, 'data_train_dataset_add_normals': False, 'data_train_dataset_add_instance': True, 'data_train_dataset_instance_oversampling': 0.0, 'data_train_dataset_place_around_existing': False, 'data_train_dataset_point_per_cut': 0, 'data_train_dataset_max_cut_region': 0, 'data_train_dataset_flip_in_center': False, 'data_train_dataset_noise_rate': 0, 'data_train_dataset_resample_points': 0, 'data_train_dataset_add_unlabeled_pc': False, 'data_train_dataset_cropping': False, 'data_train_dataset_cropping_args_min_points': 30000, 'data_train_dataset_cropping_args_aspect': 0.8, 'data_train_dataset_cropping_args_min_crop': 0.5, 'data_train_dataset_cropping_args_max_crop': 1.0, 'data_train_dataset_is_tta': False, 'data_train_dataset_crop_min_size': 20000, 'data_train_dataset_crop_length': 6.0, 'data_train_dataset_filter_out_classes': [0, 1], 'data_train_dataset_label_offset': 2, 'data_validation_dataset__target_': 'datasets.semseg.SemanticSegmentationDataset', 'data_validation_dataset_dataset_name': 'scannet', 'data_validation_dataset_data_dir': 'data/processed/scannet', 'data_validation_dataset_image_augmentations_path': None, 'data_validation_dataset_volume_augmentations_path': None, 'data_validation_dataset_label_db_filepath': 'data/processed/scannet/label_database.yaml', 'data_validation_dataset_color_mean_std': 'data/processed/scannet/color_mean_std.yaml', 'data_validation_dataset_data_percent': 1.0, 'data_validation_dataset_mode': 'validation', 'data_validation_dataset_ignore_label': 255, 'data_validation_dataset_num_labels': 20, 'data_validation_dataset_add_raw_coordinates': True, 'data_validation_dataset_add_colors': True, 'data_validation_dataset_add_normals': False, 'data_validation_dataset_add_instance': True, 'data_validation_dataset_cropping': False, 'data_validation_dataset_is_tta': False, 'data_validation_dataset_crop_min_size': 20000, 'data_validation_dataset_crop_length': 6.0, 'data_validation_dataset_filter_out_classes': [0, 1], 'data_validation_dataset_label_offset': 2, 'data_test_dataset__target_': 'datasets.semseg.SemanticSegmentationDataset', 'data_test_dataset_dataset_name': 'scannet', 'data_test_dataset_data_dir': 'data/processed/scannet', 'data_test_dataset_image_augmentations_path': None, 'data_test_dataset_volume_augmentations_path': None, 'data_test_dataset_label_db_filepath': 'data/processed/scannet/label_database.yaml', 'data_test_dataset_color_mean_std': 'data/processed/scannet/color_mean_std.yaml', 'data_test_dataset_data_percent': 1.0, 'data_test_dataset_mode': 'validation', 'data_test_dataset_ignore_label': 255, 'data_test_dataset_num_labels': 20, 'data_test_dataset_add_raw_coordinates': True, 'data_test_dataset_add_colors': True, 'data_test_dataset_add_normals': False, 'data_test_dataset_add_instance': True, 'data_test_dataset_cropping': False, 'data_test_dataset_is_tta': False, 'data_test_dataset_crop_min_size': 20000, 'data_test_dataset_crop_length': 6.0, 'data_test_dataset_filter_out_classes': [0, 1], 'data_test_dataset_label_offset': 2, 'data_train_collation__target_': 'datasets.utils.VoxelizeCollate', 'data_train_collation_ignore_label': 255, 'data_train_collation_voxel_size': 0.02, 'data_train_collation_mode': 'train', 'data_train_collation_small_crops': False, 'data_train_collation_very_small_crops': False, 'data_train_collation_batch_instance': False, 'data_train_collation_probing': False, 'data_train_collation_task': 'instance_segmentation', 'data_train_collation_ignore_class_threshold': 100, 'data_train_collation_filter_out_classes': [0, 1], 'data_train_collation_label_offset': 2, 'data_train_collation_num_queries': 100, 'data_validation_collation__target_': 'datasets.utils.VoxelizeCollate', 'data_validation_collation_ignore_label': 255, 'data_validation_collation_voxel_size': 0.02, 'data_validation_collation_mode': 'validation', 'data_validation_collation_batch_instance': False, 'data_validation_collation_probing': False, 'data_validation_collation_task': 'instance_segmentation', 'data_validation_collation_ignore_class_threshold': 100, 'data_validation_collation_filter_out_classes': [0, 1], 'data_validation_collation_label_offset': 2, 'data_validation_collation_num_queries': 100, 'data_test_collation__target_': 'datasets.utils.VoxelizeCollate', 'data_test_collation_ignore_label': 255, 'data_test_collation_voxel_size': 0.02, 'data_test_collation_mode': 'validation', 'data_test_collation_batch_instance': False, 'data_test_collation_probing': False, 'data_test_collation_task': 'instance_segmentation', 'data_test_collation_ignore_class_threshold': 100, 'data_test_collation_filter_out_classes': [0, 1], 'data_test_collation_label_offset': 2, 'data_test_collation_num_queries': 100, 'logging': [{'_target_': 'pytorch_lightning.loggers.WandbLogger', 'project': 'scannet', 'name': 'validation', 'save_dir': 'saved/validation', 'entity': 'trandangtrungduc', 'resume': 'allow', 'id': 'validation'}], 'model__target_': 'models.Mask3D', 'model_hidden_dim': 128, 'model_dim_feedforward': 1024, 'model_num_queries': 100, 'model_num_heads': 8, 'model_num_decoders': 3, 'model_dropout': 0.0, 'model_pre_norm': False, 'model_use_level_embed': False, 'model_normalize_pos_enc': True, 'model_positional_encoding_type': 'fourier', 'model_gauss_scale': 1.0, 'model_hlevels': [0, 1, 2, 3], 'model_non_parametric_queries': True, 'model_random_query_both': False, 'model_random_normal': False, 'model_random_queries': False, 'model_use_np_features': False, 'model_sample_sizes': [200, 800, 3200, 12800, 51200], 'model_max_sample_size': False, 'model_shared_decoder': True, 'model_num_classes': 19, 'model_train_on_segments': True, 'model_scatter_type': 'mean', 'model_voxel_size': 0.02, 'model_config_backbone__target_': 'models.Res16UNet34C', 'model_config_backbone_config_dialations': [1, 1, 1, 1], 'model_config_backbone_config_conv1_kernel_size': 5, 'model_config_backbone_config_bn_momentum': 0.02, 'model_config_backbone_in_channels': 3, 'model_config_backbone_out_channels': 20, 'model_config_backbone_out_fpn': True, 'metrics__target_': 'models.metrics.ConfusionMatrix', 'metrics_num_classes': 20, 'metrics_ignore_label': 255, 'optimizer__target_': 'torch.optim.AdamW', 'optimizer_lr': 0.0001, 'scheduler_scheduler__target_': 'torch.optim.lr_scheduler.OneCycleLR', 'scheduler_scheduler_max_lr': 0.0001, 'scheduler_scheduler_epochs': 601, 'scheduler_scheduler_steps_per_epoch': -1, 'scheduler_pytorch_lightning_params_interval': 'step', 'trainer_deterministic': False, 'trainer_max_epochs': 601, 'trainer_min_epochs': 1, 'trainer_resume_from_checkpoint': None, 'trainer_check_val_every_n_epoch': 50, 'trainer_num_sanity_val_steps': 2, 'callbacks': [{'_target_': 'pytorch_lightning.callbacks.ModelCheckpoint', 'monitor': 'val_mean_ap_50', 'save_last': True, 'save_top_k': 1, 'mode': 'max', 'dirpath': 'saved/validation', 'filename': '{epoch}-{val_mean_ap_50:.3f}', 'every_n_epochs': 1}, {'_target_': 'pytorch_lightning.callbacks.LearningRateMonitor'}], 'matcher__target_': 'models.matcher.HungarianMatcher', 'matcher_cost_class': 2.0, 'matcher_cost_mask': 5.0, 'matcher_cost_dice': 2.0, 'matcher_num_points': -1, 'loss__target_': 'models.criterion.SetCriterion', 'loss_num_classes': 19, 'loss_eos_coef': 0.1, 'loss_losses': ['labels', 'masks'], 'loss_num_points': -1, 'loss_oversample_ratio': 3.0, 'loss_importance_sample_ratio': 0.75, 'loss_class_weights': -1}
/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:446: LightningDeprecationWarning: Setting `Trainer(gpus=1)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=1)` instead.
rank_zero_deprecation(
/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:57: LightningDeprecationWarning: Setting `Trainer(weights_save_path=)` has been deprecated in v1.6 and will be removed in v1.8. Please pass ``dirpath`` directly to the `ModelCheckpoint` callback
rank_zero_deprecation(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:616: UserWarning: Checkpoint directory /media/vcl/SSD-DATA/Duc/Project/Mask3D/saved/validation exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
-------------------------------------------
0 | model | Mask3D | 39.6 M
1 | criterion | SetCriterion | 0
-------------------------------------------
39.6 M Trainable params
0 Non-trainable params
39.6 M Total params
158.466 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:08<00:00, 4.04s/it]Computing AP for class: 6
/media/vcl/SSD-DATA/Duc/Project/Mask3D/utils/votenet_utils/eval_det.py:166: RuntimeWarning: invalid value encountered in divide
rec = tp / float(npos)
6 nan
Computing AP for class: 3
3 0.0
Computing AP for class: 7
7 0.0
Computing AP for class: 5
5 0.0
Computing AP for class: 39
39 0.0
Computing AP for class: 24
24 0.0
Computing AP for class: 9
9 0.0
Computing AP for class: 12
12 0.0
Computing AP for class: 34
34 0.0
Computing AP for class: 8
8 0.0
Computing AP for class: 6
6 nan
Computing AP for class: 3
3 0.0
Computing AP for class: 7
7 0.0
Computing AP for class: 5
5 0.0
Computing AP for class: 39
39 0.0
Computing AP for class: 24
24 0.0
Computing AP for class: 9
9 0.0
Computing AP for class: 12
12 0.0
Computing AP for class: 34
34 0.0
Computing AP for class: 8
8 0.0
evaluating 2 scans...
scans processed: 2
################################################################
what : AP AP_50% AP_25%
################################################################
cabinet : 0.000 0.000 0.000
bed : nan nan nan
chair : 0.000 0.000 0.000
sofa : nan nan nan
table : 0.000 0.000 0.000
door : 0.000 0.000 0.000
window : 0.000 0.000 0.000
bookshelf : nan nan nan
picture : nan nan nan
counter : 0.000 0.000 0.000
desk : nan nan nan
curtain : nan nan nan
refrigerator : 0.000 0.000 0.000
shower curtain : nan nan nan
toilet : nan nan nan
sink : 0.000 0.000 0.000
bathtub : nan nan nan
otherfurniture : 0.000 0.000 0.000
----------------------------------------------------------------
average : 0.000 0.000 0.000
Epoch 0: 0%| | 0/1201 [00:00<?, ?it/s]data exceeds threshold
Traceback (most recent call last):
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
lambda: hydra.run(
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 107, in run
return run_job(
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/core/utils.py", line 128, in run_job
ret.return_value = task_function(task_cfg)
File "/media/vcl/SSD-DATA/Duc/Project/Mask3D/main_instance_segmentation.py", line 108, in main
train(cfg)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/main.py", line 27, in decorated_main
return task_function(cfg_passthrough)
File "/media/vcl/SSD-DATA/Duc/Project/Mask3D/main_instance_segmentation.py", line 84, in train
runner.fit(model)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1673, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/torch/optim/optimizer.py", line 113, in wrapper
return func(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/torch/optim/adamw.py", line 119, in step
loss = closure()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure
closure_result = closure()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 132, in closure
step_output = self._step_fn()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 407, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1706, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 358, in training_step
return self.model.training_step(*args, **kwargs)
File "/media/vcl/SSD-DATA/Duc/Project/Mask3D/trainer/trainer.py", line 133, in training_step
raise RuntimeError("BATCH TOO BIG")
RuntimeError: BATCH TOO BIG
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/media/vcl/SSD-DATA/Duc/Project/Mask3D/main_instance_segmentation.py", line 114, in <module>
main()
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
run_and_report(
File "/home/vcl/anaconda3/envs/mask3d_cuda113/lib/python3.10/site-packages/hydra/_internal/utils.py", line 267, in run_and_report
print_exception(etype=None, value=ex, tb=final_tb) # type: ignore
TypeError: print_exception() got an unexpected keyword argument 'etype'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: ๐ View run validation at: https://wandb.ai/trandangtrungduc/scannet/runs/validation
wandb: Synced 3 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: saved/validation/wandb/run-20230718_105041-validation/logs
./scripts/scannet/scannet_val.sh: line 27: general.dbscan_eps=0.95: command not found
Hi!
We train on A40 GPU with 48GB RAM with 2cm voxelization.
You can train on 3090s with 5cm voxelization, i.e. data.voxel_size=0.05
. However, performance will be worse.
Best,
Jonas
Thank you for your advice. I modified it to data.voxel_size=0.05
and data.batch_size=1
but I still encountered "BATCH TOO BIG". After retracing the code, I think the problem comes from this snippet in trainer.py
:
def training_step(self, batch, batch_idx):
data, target, file_names = batch
if data.features.shape[0] > self.config.general.max_batch_size:
print("data exceeds threshold")
raise RuntimeError("BATCH TOO BIG")
With max_batch_size=8
and data.features.shape[0]
is a very large number. Do you think it's a code logic error? I would appreciate it if you give me some advice.
Moreover, when I convert the above code into a comment, I can train but the result I get is 0 and nan like this. Is it a problem or should I wait a few more epochs?
################################################################
what : AP AP_50% AP_25%
################################################################
cabinet : 0.000 0.000 0.000
bed : nan nan nan
chair : 0.000 0.000 0.000
sofa : nan nan nan
table : 0.000 0.000 0.000
door : 0.000 0.000 0.000
window : 0.000 0.000 0.000
bookshelf : nan nan nan
picture : nan nan nan
counter : 0.000 0.000 0.000
desk : nan nan nan
curtain : nan nan nan
refrigerator : 0.000 0.000 0.000
shower curtain : nan nan nan
toilet : nan nan nan
sink : 0.000 0.000 0.000
bathtub : nan nan nan
otherfurniture : 0.000 0.000 0.000
----------------------------------------------------------------
average : 0.000 0.000 0.000
Thank you for your advice. I modified it to
data.voxel_size=0.05
anddata.batch_size=1
but I still encountered "BATCH TOO BIG". After retracing the code, I think the problem comes from this snippet intrainer.py
:def training_step(self, batch, batch_idx): data, target, file_names = batch if data.features.shape[0] > self.config.general.max_batch_size: print("data exceeds threshold") raise RuntimeError("BATCH TOO BIG")
With
max_batch_size=8
anddata.features.shape[0]
is a very large number. Do you think it's a code logic error? I would appreciate it if you give me some advice. Moreover, when I convert the above code into a comment, I can train but the result I get is 0 and nan like this. Is it a problem or should I wait a few more epochs?################################################################ what : AP AP_50% AP_25% ################################################################ cabinet : 0.000 0.000 0.000 bed : nan nan nan chair : 0.000 0.000 0.000 sofa : nan nan nan table : 0.000 0.000 0.000 door : 0.000 0.000 0.000 window : 0.000 0.000 0.000 bookshelf : nan nan nan picture : nan nan nan counter : 0.000 0.000 0.000 desk : nan nan nan curtain : nan nan nan refrigerator : 0.000 0.000 0.000 shower curtain : nan nan nan toilet : nan nan nan sink : 0.000 0.000 0.000 bathtub : nan nan nan otherfurniture : 0.000 0.000 0.000 ---------------------------------------------------------------- average : 0.000 0.000 0.000
I'm facing the same problem as yours, like:
################################################################
what : AP AP_50% AP_25%
################################################################
ceiling : 0.000 0.000 0.000
floor : 0.000 0.000 0.000
wall : 0.001 0.006 0.009
beam : 0.000 0.000 0.000
column : nan nan nan
window : 0.000 0.000 0.000
door : 0.000 0.000 0.000
table : 0.101 0.283 0.283
chair : 0.000 0.000 0.000
sofa : 0.000 0.000 0.000
bookcase : nan nan nan
board : nan nan nan
clutter : nan nan nan
average : 0.011 0.032 0.032
Have you already solved the problem or found what caused it? I would appreciate it if you give me some advice.
Hi,
max_batch_size=8
means that you only train on 8 points in total in one training iterations. Therefore all point clouds will be filtered and your network cannot learn anything. This parameter is very sensititve. Try to set it as large as possible until you run into memory issues.
Let me know if this helps :)
Best,
Jonas
Thanks for your reply!
Parameter max_batch_size
is set to a default value of 99999999 in the config file here and i didn't modify it. But I did set data.batch_size=1
in .sh file to solve the memory issues. Now, follow your advice, i set it larger and enlarge the parameter data.voxel_size
instead, do you think this modification is appropriate?
Furthermore, I encountered another issue: i'm training on my custom dataset and have defined my own list of semantic classes. However, during the validation stage, the semantic classes listed in the AP table are always ceiling, floor, wall, beam(shown as my comment above), which are different from my custom classes list. Could you please guide me on how to modify this?
I eagerly await your response!