Problem of Training code

Question

Problem of Training code

Closed this issue a year ago · 3 comments

Firstly, I want to congratulate you on the presentation of your paper at ICCV. I found the paper quite interesting and I'm trying to understand the algorithm using the provided code. The evaluation seems to work without any issues, but I'm encountering errors with the training code.

I tried running the following code, as suggested:

python train_net.py \
--config-file configs/cityscapes/semantic-segmentation/swin/single_decoder_layer/maskformer2_swin_base_IN21k_384_bs16_90k_1dl.yaml \
--num-gpus 4 \
OUTPUT_DIR  model_logs/swin_b_1dl/

And I got the following error:

  File "/home/kimin/PT-2.0/RbA/mask2former/maskformer_model.py", line 276, in forward
    targets = self.prepare_targets(gt_instances, images)
  File "/home/kimin/PT-2.0/RbA/mask2former/maskformer_model.py", line 373, in prepare_targets
    "labels": targets_per_image.gt_classes,  # erors here
  File "/home/kimin/PT-2.0/detectron2/detectron2/structures/instances.py", line 66, in __getattr__
    raise AttributeError("Cannot find field '{}' in the given Instances!".format(name))
AttributeError: Cannot find field 'gt_classes' in the given Instances!

Looking at the original Mask2Former code, it seems that there is a line in the mask_former_semantic_dataset_mapper.py file around line 183:

instances.gt_classes = torch.tensor(classes, dtype=torch.int64)

which seems to be missing in the version I am using and seems to be causing the error. After adding it back, I got a new error, a RunTime Error due to a nan value, preventing the training from proceeding:

    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'ToCopyBackward0' returned nan values in its 0th output.

From what I understood from the paper, the Cityscapes Inlier Training with Mask2Former config is a baseline trained on Cityscapes, and the RbA + COCO Outlier Supervision config is the method proposed for performance improvement.

When trying to run only the RbA + COCO Outlier Supervision training config, I get an error about missing initial weights at WEIGHTS: model_logs/mask2former_dec_layers_2_res5_only/model_final.pth. I tried using the model_final from the Cityscapes Inlier Training, but I still get an error during training.

While I recognize that this could be an issue with the way I have set up Mask2Former, and I plan to try training Mask2Former again, I would appreciate any help you could provide in resolving this issue or making the training code operational.

Answer 1 · 2023-07-26T23:52:46.000Z

Thank you for updating the code. If I understand correctly, the training code in the current Readme is for performing Semantic Segmentation training with Mask2Former, not the method proposed in your paper. This would be the way to generate the CityScapes Inlier Training model in the RbA Model Zoo.

It seems that this model is the basis for further training according to the method proposed in your paper. Looking at the checkpoints,

model_logs/mask2former_dec_layers_2_res5_only/model_final.pth

This appears to be the result model obtained from the CityScapes Inlier Training, which should be used.

Also, the modified code still gives a NaN error. Upon further inspection, it seems to be an issue with the following conditional statement. However, it seems that the training in Mask2Former is carried out without this:

with torch.autograd.set_detect_anomaly(True):

If there's anything I've misunderstood, I'd appreciate it if you could correct me. I'm also planning to try training with the modules proposed in your paper.

Answer 2 · 2023-07-27T07:54:53.000Z

Hi @kimin-yun, thank you for your interest in our work.

These issues you have encountered result from the code cleanup and the renaming of model names for clarity, therefore we apologize for encountering them. We have applied the following changes to address the problems.

Small modifications to the data mapper
removal of the with torch.autograd.set_detect_anomaly(True): in train_net.py as you suggested, it was used at some point for debugging purposes but no longer used in the main training code.
Update config files to the newly renamed model checkpoints

As for the training code, the config files we provide are for both inlier training and RbA fine-tuning with OoD Data. The checkpoint model_logs/mask2former_dec_layers_2_res5_only/model_final.pth is basically the same as ./ckpts/swin_b_1dl/model_final.pth, we simply renamed it for more clarity but it remained in some of the config files, we updated all of the config files to the newly named path accordingly. ./ckpts/swin_l_1dl/config.yaml as also used for inlier training but with the Swin L backbone.

For outlier fine-tuning with RbA, you can use the following configs which should work after fixing the nan issue:

./ckpts/swin_b_1dl_rba_ood_coco/config.yaml
./ckpts/swin_b_1dl_rba_ood_map_coco/config.yaml
./ckpts/swin_l_1dl_rba_ood_map_coco/config.yaml

Thank you again for your interest and please do not hesitate to communicate any issues you encounter we will try to address them asap.

Answer 3 · 2023-07-27T08:44:07.000Z

Thank you for your helpful responses and updates.