OPEN-AIR-SUN/mars

Black evaluation images

Closed this issue · 4 comments

Hi,

I'm currently experimenting with the PandaSet data parser based on the fork from @pierremerriaux-leddartech . When training, the evaluation images look good initially and the quality of the images increases gradually. However, around step 13500 the evaluation image turns completely black and stays like this for the rest of the training. Also the accumulation and the depth maps look strange, beginning on step 13500 evaluation. Do you have any idea what could cause this behaviour?
Here the images from the previous evaluation at step 13000:

img_13000_6b1379e7846e0f1391b4
objects_depth_13000_7204beef165a53577f3a
objects_rgb_13000_5679c1c53e9bbc01c6fa
accumulation_13000_765d50cc9d5ddfa55158
background_13000_d771604c6016ee983686

And here at step 13500:

img_13500_951a44aaebeff865bb72
objects_depth_13500_1752a22f2675141f2f77
objects_rgb_13500_51e457f4207c6361d8d2
accumulation_13500_5783615c36587133a80c
background_13500_51e457f4207c6361d8d2

And here the config:

Expand to see config
!!python/object:nerfstudio.engine.trainer.TrainerConfig
_target: !!python/name:nerfstudio.engine.trainer.Trainer ''
data: &id003 !!python/object/apply:pathlib.PosixPath
- /
- zfs
- penshorn
- master_thesis
- datasets
- raw
- PandaSet
experiment_name: PandaSet
gradient_accumulation_steps: 1
load_checkpoint: null
load_config: null
load_dir: null
load_scheduler: true
load_step: null
log_gradients: true
logging: !!python/object:nerfstudio.configs.base_config.LoggingConfig
local_writer: !!python/object:nerfstudio.configs.base_config.LocalWriterConfig
  _target: !!python/name:nerfstudio.utils.writer.LocalWriter ''
  enable: true
  max_log_size: 10
  stats_to_track: !!python/tuple
  - !!python/object/apply:nerfstudio.utils.writer.EventName
    - Train Iter (time)
  - !!python/object/apply:nerfstudio.utils.writer.EventName
    - Train Rays / Sec
  - !!python/object/apply:nerfstudio.utils.writer.EventName
    - Test PSNR
  - !!python/object/apply:nerfstudio.utils.writer.EventName
    - Vis Rays / Sec
  - !!python/object/apply:nerfstudio.utils.writer.EventName
    - Test Rays / Sec
  - !!python/object/apply:nerfstudio.utils.writer.EventName
    - ETA (time)
max_buffer_size: 20
profiler: basic
relative_log_dir: !!python/object/apply:pathlib.PosixPath []
steps_per_log: 10
machine: !!python/object:nerfstudio.configs.base_config.MachineConfig
device_type: cuda
dist_url: auto
machine_rank: 0
num_devices: 1
num_machines: 1
seed: 42
max_num_iterations: 600000
method_name: mars-pandaset-nerfacto-object-wise-recon
mixed_precision: false
optimizers:
background_model:
  optimizer: !!python/object:nerfstudio.engine.optimizers.RAdamOptimizerConfig
    _target: &id001 !!python/name:torch.optim.radam.RAdam ''
    eps: 1.0e-15
    lr: 0.001
    max_norm: null
    weight_decay: 0
  scheduler: !!python/object:nerfstudio.engine.schedulers.ExponentialDecaySchedulerConfig
    _target: &id002 !!python/name:nerfstudio.engine.schedulers.ExponentialDecayScheduler ''
    lr_final: 1.0e-05
    lr_pre_warmup: 1.0e-08
    max_steps: 200000
    ramp: cosine
    warmup_steps: 0
learnable_global:
  optimizer: !!python/object:nerfstudio.engine.optimizers.RAdamOptimizerConfig
    _target: *id001
    eps: 1.0e-15
    lr: 0.001
    max_norm: null
    weight_decay: 0
  scheduler: !!python/object:nerfstudio.engine.schedulers.ExponentialDecaySchedulerConfig
    _target: *id002
    lr_final: 1.0e-05
    lr_pre_warmup: 1.0e-08
    max_steps: 200000
    ramp: cosine
    warmup_steps: 0
object_model:
  optimizer: !!python/object:nerfstudio.engine.optimizers.RAdamOptimizerConfig
    _target: *id001
    eps: 1.0e-15
    lr: 0.005
    max_norm: null
    weight_decay: 0
  scheduler: !!python/object:nerfstudio.engine.schedulers.ExponentialDecaySchedulerConfig
    _target: *id002
    lr_final: 1.0e-05
    lr_pre_warmup: 1.0e-08
    max_steps: 200000
    ramp: cosine
    warmup_steps: 0
output_dir: !!python/object/apply:pathlib.PosixPath
- outputs
pipeline: !!python/object:mars.mars_pipeline.MarsPipelineConfig
_target: !!python/name:mars.mars_pipeline.MarsPipeline ''
datamanager: !!python/object:mars.data.mars_datamanager.MarsDataManagerConfig
  _target: !!python/name:mars.data.mars_datamanager.MarsDataManager ''
  camera_optimizer: !!python/object:nerfstudio.cameras.camera_optimizers.CameraOptimizerConfig
    _target: !!python/name:nerfstudio.cameras.camera_optimizers.CameraOptimizer ''
    mode: 'off'
    optimizer: !!python/object:nerfstudio.engine.optimizers.AdamOptimizerConfig
      _target: !!python/name:torch.optim.adam.Adam ''
      eps: 1.0e-15
      lr: 0.0006
      max_norm: null
      weight_decay: 0
    orientation_noise_std: 0.0
    param_group: camera_opt
    position_noise_std: 0.0
    scheduler: !!python/object:nerfstudio.engine.schedulers.ExponentialDecaySchedulerConfig
      _target: *id002
      lr_final: null
      lr_pre_warmup: 1.0e-08
      max_steps: 10000
      ramp: cosine
      warmup_steps: 0
  camera_res_scale_factor: 1.0
  collate_fn: !!python/name:nerfstudio.data.utils.nerfstudio_collate.nerfstudio_collate ''
  data: *id003
  dataparser: !!python/object:mars.data.mars_pandaset_dataparser.MarsPandasetDataParserConfig
    _target: !!python/name:mars.data.mars_pandaset_dataparser.MarsPandasetParser ''
    add_input_rows: -1
    alpha_color: white
    bckg_only: false
    box_scale: 1.1
    cameras_name_list:
    - front_camera
    car_nerf_state_dict_path: !!python/object/apply:pathlib.PosixPath
    - /
    - home
    - pierre.merriaux
    - data
    - mars-nerf
    - latents
    - KITTI-MOT
    - car-nerf-state-dict
    - epoch_670.ckpt
    car_object_latents_path: !!python/object/apply:pathlib.PosixPath
    - /
    - home
    - pierre.merriaux
    - project
    - mars-refact
    - mars
    - pandaset_init_seq109.pt
    chunk: 32768
    data: !!python/object/apply:pathlib.PosixPath
    - data
    - pandaset
    dataset_type: pandaset
    far_plane: 150.0
    first_frame: 0
    last_frame: 40
    max_input_objects: -1
    near_plane: 0.5
    netchunk: 65536
    novel_view: left
    obj_only: false
    obj_opaque: true
    object_setting: 0
    render_only: false
    scale_factor: 0.01
    scene_scale: 1.0
    semantic_mask_classes: []
    semantic_path: !!python/object/apply:pathlib.PosixPath []
    seq_name: '011'
    split_setting: reconstruction
    use_car_latents: false
    use_depth: false
    use_obj: true
    use_object_properties: true
    use_semantic: false
  eval_image_indices: !!python/tuple
  - 0
  eval_num_images_to_sample_from: -1
  eval_num_rays_per_batch: 4096
  eval_num_times_to_repeat_images: -1
  images_on_gpu: false
  masks_on_gpu: false
  patch_size: 1
  pixel_sampler: !!python/object:nerfstudio.data.pixel_samplers.PixelSamplerConfig
    _target: !!python/name:nerfstudio.data.pixel_samplers.PixelSampler ''
    is_equirectangular: false
    keep_full_image: false
    num_rays_per_batch: 4096
  train_num_images_to_sample_from: -1
  train_num_rays_per_batch: 4096
  train_num_times_to_repeat_images: -1
model: !!python/object:mars.models.scene_graph.SceneGraphModelConfig
  _target: !!python/name:mars.models.scene_graph.SceneGraphModel ''
  background_color: black
  background_model: !!python/object:mars.models.nerfacto.NerfactoModelConfig
    _target: &id004 !!python/name:mars.models.nerfacto.NerfactoModel ''
    appearance_embed_dim: 32
    background_color: black
    base_res: 16
    collider_params:
      far_plane: 6.0
      near_plane: 2.0
    disable_scene_contraction: false
    distortion_loss_mult: 0.002
    enable_collider: true
    eval_num_rays_per_chunk: 4096
    far_plane: 150.0
    features_per_level: 2
    hidden_dim: 64
    hidden_dim_color: 64
    hidden_dim_transient: 64
    implementation: tcnn
    interlevel_loss_mult: 1.0
    log2_hashmap_size: 19
    loss_coefficients:
      rgb_loss_coarse: 1.0
      rgb_loss_fine: 1.0
    max_res: 2048
    near_plane: 0.05
    num_levels: 16
    num_nerf_samples_per_ray: 97
    num_proposal_iterations: 2
    num_proposal_samples_per_ray: &id005 !!python/tuple
    - 256
    - 128
    obj_feat_dim: 0
    orientation_loss_mult: 0.0001
    pred_normal_loss_mult: 0.001
    predict_normals: false
    prompt: null
    proposal_initial_sampler: piecewise
    proposal_net_args_list:
    - hidden_dim: 16
      log2_hashmap_size: 17
      max_res: 128
      num_levels: 5
      use_linear: false
    - hidden_dim: 16
      log2_hashmap_size: 17
      max_res: 256
      num_levels: 5
      use_linear: false
    proposal_update_every: 5
    proposal_warmup: 5000
    proposal_weights_anneal_max_num_iters: 1000
    proposal_weights_anneal_slope: 10.0
    use_average_appearance_embedding: true
    use_gradient_scaling: false
    use_proposal_weight_anneal: true
    use_same_proposal_network: false
    use_single_jitter: true
  collider_params:
    far_plane: 6.0
    near_plane: 2.0
  debug_object_pose: false
  depth_loss_mult: 0
  depth_loss_type: !!python/object/apply:nerfstudio.model_components.losses.DepthLossType
  - 1
  depth_sigma: 0.05
  enable_collider: true
  eval_num_rays_per_chunk: 4096
  far_plane: 1000.0
  interlevel_loss_mult: 1.0
  is_euclidean_depth: false
  latent_size: 256
  loss_coefficients:
    rgb_loss_coarse: 1.0
    rgb_loss_fine: 1.0
  max_num_obj: -1
  mono_depth_loss_mult: 0.01
  near_plane: 0.05
  object_model_template: !!python/object:mars.models.nerfacto.NerfactoModelConfig
    _target: *id004
    appearance_embed_dim: 32
    background_color: black
    base_res: 16
    collider_params:
      far_plane: 6.0
      near_plane: 2.0
    disable_scene_contraction: false
    distortion_loss_mult: 0.002
    enable_collider: true
    eval_num_rays_per_chunk: 4096
    far_plane: 150.0
    features_per_level: 2
    hidden_dim: 64
    hidden_dim_color: 64
    hidden_dim_transient: 64
    implementation: tcnn
    interlevel_loss_mult: 1.0
    log2_hashmap_size: 19
    loss_coefficients:
      rgb_loss_coarse: 1.0
      rgb_loss_fine: 1.0
    max_res: 2048
    near_plane: 0.05
    num_levels: 16
    num_nerf_samples_per_ray: 97
    num_proposal_iterations: 2
    num_proposal_samples_per_ray: *id005
    obj_feat_dim: 0
    orientation_loss_mult: 0.0001
    pred_normal_loss_mult: 0.001
    predict_normals: false
    prompt: null
    proposal_initial_sampler: piecewise
    proposal_net_args_list:
    - hidden_dim: 16
      log2_hashmap_size: 17
      max_res: 128
      num_levels: 5
      use_linear: false
    - hidden_dim: 16
      log2_hashmap_size: 17
      max_res: 256
      num_levels: 5
      use_linear: false
    proposal_update_every: 5
    proposal_warmup: 5000
    proposal_weights_anneal_max_num_iters: 1000
    proposal_weights_anneal_slope: 10.0
    use_average_appearance_embedding: true
    use_gradient_scaling: false
    use_proposal_weight_anneal: true
    use_same_proposal_network: false
    use_single_jitter: true
  object_ray_sample_strategy: remove-bg
  object_representation: object-wise
  object_warmup_steps: 1000
  orientation_loss_mult: 0.0001
  pred_normal_loss_mult: 0.001
  predict_normals: false
  prompt: null
  ray_add_input_rows: -1
  semantic_loss_mult: 1.0
  should_decay_sigma: false
  sigma_decay_rate: 0.9998
  sky_model: !!python/object:mars.models.sky_model.SkyModelConfig
    _target: !!python/name:mars.models.sky_model.SkyModel ''
    collider_params:
      far_plane: 6.0
      near_plane: 2.0
    enable_collider: true
    eval_num_rays_per_chunk: 4096
    hidden_dim: 128
    loss_coefficients:
      rgb_loss_coarse: 1.0
      rgb_loss_fine: 1.0
    num_layers: 5
    prompt: null
  starting_depth_sigma: 4.0
  use_interlevel_loss: true
  use_sky_model: false
project_name: nerfstudio-project
prompt: null
relative_model_dir: !!python/object/apply:pathlib.PosixPath
- nerfstudio_models
save_only_latest_checkpoint: true
steps_per_eval_all_images: 5000
steps_per_eval_batch: 500
steps_per_eval_image: 500
steps_per_save: 2000
timestamp: 2024-01-12_222029
use_grad_scaler: true
viewer: !!python/object:nerfstudio.configs.base_config.ViewerConfig
image_format: jpeg
jpeg_quality: 90
make_share_url: false
max_num_display_images: 512
num_rays_per_chunk: 32768
quit_on_train_completion: true
relative_log_filename: viewer_log_filename.txt
websocket_host: 0.0.0.0
websocket_port: null
websocket_port_default: 7007
vis: wandb

I have met the same problem. You can check the rgb_loss whether it is NaN. Usually, you can decrease the learning rate to mitigate it.

Hi! @j-pens

Maybe you can check the gradient?

I reran the training with use_grad_scaler=False and that seems to have fixed it. I noticed that this option is enabled for all other methods. Could there be anything related to this option that might lead to the previous issue?

Edit: Gradient also seems to be NaN

Hi, On my system it crashs with use_grad_scaler=True, so I always deactivated it.