Why can't I achieve the experimental effect in the paper?

Question

Why can't I achieve the experimental effect in the paper?

ssunguotu opened this issue a year ago · 8 comments

Thanks for your work!
In the Day Clear scene, I achieved an mAP of only 46.4%, which differs from the reported mAP of 51.3% in the research paper. I used two NVIDIA RTX 3090 GPUs for training and made some modifications to the train.py code to enable multiple GPU training.

 if __name__ == "__main__":
    args = default_argument_parser().parse_args()
    print("Command Line Args:", args)
    launch(
        main,
        args.num_gpus,
        num_machines=args.num_machines,
        machine_rank=args.machine_rank,
        dist_url=args.dist_url,
        args=(args,),
    )

Additionally, in the run_step function, I made the following changes:

        opt_phase = False
        if len(self.off_opt_interval) and self.iter >= self.off_opt_interval[0] and self.iter < self.off_opt_interval[0]+self.off_opt_iters:
        
            if self.iter == self.off_opt_interval[0]:
                self.model.module.offsets.data = torch.zeros(self.model.module.offsets.shape).cuda()
            loss_dict_s = self.model.module.opt_offsets(data_s)
            opt_phase = True
            if self.iter+1 == self.off_opt_interval[0]+self.off_opt_iters:
                self.off_opt_interval.pop(0)

Waiting for your early reply, thank you!

Answer 1 · 2023-08-29T02:46:59.000Z

ohh, I found I got the same results when I used the model you posted. Am I testing the model in the wrong way?
I modify the code like that:
in configs/diverse_weather.yaml

MODEL:
        BACKBONE:
                NAME: ClipRN101
        WEIGHTS: "/code/domaingen/diverse-weights.pth"

and run
python train.py --eval-only --config-file configs/diverse_weather.yaml

Answer 2 · 2023-08-29T11:41:45.000Z

Hello @ssunguotu, thank you for your interest.

The evaluation command you mentioned is correct. Could you please verify that the checkpoint returns the reported mAP@50 without your code modifications?

Answer 3 · 2023-08-30T03:53:23.000Z

Thank you for your reply!
I have verified that the checkpoint returns the same mAP@50 without my code modifications.
Here is my log when I test the model. I found it strange that my dataset seemed to have only 8289 images when I testing, but the datasets actually have 8313 images. I have checked the /daytime_clear/ImageSets/Main/test.txt, /daytime_clear/Annotations, and /daytime_clear/JPEGImages, the numbers are all right.
I don't know whether it's the point to the result difference.

[08/29 10:40:34 detectron2]: Full config saved to all_outs/diverse_weather/origin_v2/config.yaml
[08/29 10:40:34 d2.utils.env]: Using a generated random seed 34803406
['bus', 'bike', 'car', 'motor', 'person', 'rider', 'truck']
[08/29 10:41:03 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /code/domaingen/diverse-weights.pth ...
[08/29 10:41:03 fvcore.common.checkpoint]: [Checkpointer] Loading from /code/domaingen/diverse-weights.pth ...
[08/29 10:41:09 d2.data.build]: Distribution of instances among all 7 categories:
|  category  | #instances   |  category  | #instances   |  category  | #instances   |
|:----------:|:-------------|:----------:|:-------------|:----------:|:-------------|
|    bus     | 1738         |    bike    | 1046         |    car     | 95339        |
|   motor    | 537          |   person   | 12309        |   rider    | 787          |
|   truck    | 5029         |            |              |            |              |
|   total    | 116785       |            |              |            |              |
[08/29 10:41:09 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(600, 600), max_size=1333, sample_style='choice')]
[08/29 10:41:09 d2.data.common]: Serializing the dataset using: <class 'detectron2.data.common._TorchSerializedList'>
[08/29 10:41:09 d2.data.common]: Serializing 8289 elements to byte tensors and concatenating them all ...
[08/29 10:41:09 d2.data.common]: Serialized dataset takes 7.92 MiB
[08/29 10:41:09 d2.evaluation.evaluator]: Start inference on 8289 batches
/miniconda3/envs/frcnn/lib/python3.10/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[08/29 10:41:16 d2.evaluation.evaluator]: Inference done 11/8289. Dataloading: 0.0016 s/iter. Inference: 0.2062 s/iter. Eval: 0.0007 s/iter. Total: 0.2085 s/iter. ETA=0:28:45
[08/29 10:41:21 d2.evaluation.evaluator]: Inference done 36/8289. Dataloading: 0.0025 s/iter. Inference: 0.2000 s/iter. Eval: 0.0007 s/iter. Total: 0.2033 s/iter. ETA=0:27:57
[08/29 10:41:26 d2.evaluation.evaluator]: Inference done 62/8289. Dataloading: 0.0022 s/iter. Inference: 0.1971 s/iter. Eval: 0.0007 s/iter. Total: 0.2001 s/iter. ETA=0:27:25
..........
[08/29 11:08:47 d2.evaluation.evaluator]: Inference done 8284/8289. Dataloading: 0.0014 s/iter. Inference: 0.1975 s/iter. Eval: 0.0006 s/iter. Total: 0.1995 s/iter. ETA=0:00:00
[08/29 11:08:48 d2.evaluation.evaluator]: Total inference time: 0:27:32.976390 (0.199538 s / iter per device, on 1 devices)
[08/29 11:08:48 d2.evaluation.evaluator]: Total inference pure compute time: 0:27:16 (0.197507 s / iter per device, on 1 devices)
[08/29 11:08:48 d2.evaluation.pascal_voc_evaluation]: Evaluating daytime_clear_test using 2007 metric. Note that results do not use the official Matlab API.
[08/29 11:11:11 d2.evaluation.pascal_voc_evaluation]: classwise ap 53.27,42.05,57.16,39.37,39.86,40.24,52.46
[08/29 11:11:11 detectron2]: Evaluation results for daytime_clear_test in csv format:
[08/29 11:11:11 d2.evaluation.testing]: copypaste: Task: bbox
[08/29 11:11:11 d2.evaluation.testing]: copypaste: AP,AP50,AP75
[08/29 11:11:11 d2.evaluation.testing]: copypaste: 22.8409,46.3442,18.9491

Answer 4 · 2023-08-30T14:05:40.000Z

I re-ran the evaluation code with the provided checkpoint and was able to get the 51.3 mAP.

[08/30 13:32:03 detectron2]: Full config saved to all_outs/diverse_weather/config.yaml
[08/30 13:32:03 d2.utils.env]: Using a generated random seed 3127000
['bus', 'bike', 'car', 'motor', 'person', 'rider', 'truck']
[08/30 13:32:18 fvcore.common.checkpoint]: [Checkpointer] Loading from diverse-weights.pth ...
[08/30 13:32:22 d2.data.build]: Distribution of instances among all 7 categories:
|  category  | #instances   |  category  | #instances   |  category  | #instances   |
|:----------:|:-------------|:----------:|:-------------|:----------:|:-------------|
|    bus     | 1738         |    bike    | 1046         |    car     | 95339        |
|   motor    | 537          |   person   | 12309        |   rider    | 787          |
|   truck    | 5029         |            |              |            |              |
|   total    | 116785       |            |              |            |              |
[08/30 13:32:22 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(600, 600), max_size=1333, sample_style='choice')]
[08/30 13:32:22 d2.data.common]: Serializing 8289 elements to byte tensors and concatenating them all ...
[08/30 13:32:22 d2.data.common]: Serialized dataset takes 8.26 MiB
[08/30 13:32:22 d2.evaluation.evaluator]: Start inference on 8289 batches
[08/30 13:32:24 d2.evaluation.evaluator]: Inference done 11/8289. Dataloading: 0.0012 s/iter. Inference: 0.0842 s/iter. Eval: 0.0004 s/iter. Total: 0.0858 s/iter. ETA=0:11:50 

[08/30 13:45:06 d2.evaluation.evaluator]: Inference done 8253/8289. Dataloading: 0.0011 s/iter. Inference: 0.0907 s/iter. Eval: 0.0006 s/iter. Total: 0.0924 s/iter. ETA=0:00:03
[08/30 13:45:10 d2.evaluation.evaluator]: Total inference time: 0:12:46.125840 (0.092483 s / iter per device, on 1 devices)
[08/30 13:45:10 d2.evaluation.evaluator]: Total inference pure compute time: 0:12:31 (0.090753 s / iter per device, on 1 devices)
[08/30 13:45:10 d2.evaluation.pascal_voc_evaluation]: Evaluating daytime_clear_test using 2007 metric. Note that results do not use the official Matlab API.
[08/30 13:47:31 d2.evaluation.pascal_voc_evaluation]: classwise ap 54.90,46.23,66.08,45.19,47.45,44.63,54.33
[08/30 13:47:31 detectron2]: Evaluation results for daytime_clear_test in csv format:
[08/30 13:47:31 d2.evaluation.testing]: copypaste: Task: bbox
[08/30 13:47:31 d2.evaluation.testing]: copypaste: AP,AP50,AP75
[08/30 13:47:31 d2.evaluation.testing]: copypaste: 27.3545,51.2566,24.1571

Could you please verify that the requirements are properly set up? detectron2 version 0.6 and torch 1.10 , both with cuda 11.3

Answer 5 · 2023-08-31T08:17:08.000Z

The mAP result changed after I rebuilt the environment, but it still different to 51.3, even more lower.
I am really confused... Why can the results be changed?
Here is my environment:

sys.platform            linux
Python                  3.9.17 (main, Jul  5 2023, 20:41:20) [GCC 11.2.0]
numpy                   1.21.5
detectron2              0.6 @/code/detectron2/detectron2
Compiler                GCC 9.4
CUDA compiler           CUDA 11.3
detectron2 arch flags   8.6
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.10.0 @/miniconda3/envs/detectron/lib/python3.9/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0,1                 NVIDIA GeForce RTX 3090 (arch=8.6)
Driver version          470.103.01
CUDA_HOME               /usr/local/cuda
Pillow                  8.2.0
torchvision             0.11.0 @/miniconda3/envs/detectron/lib/python3.9/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.5.post20221221
iopath                  0.1.9
cv2                     4.5.2

Answer 6 · 2023-08-31T11:23:19.000Z

OK... It seems that I found the reason. The daytime_clear datasets in my server is different to the official version. I replaced the old datasets with the latest version, and the result worked fine. The mAP is 52.3, even better than the reported results.

[08/31 18:44:54 d2.evaluation.pascal_voc_evaluation]: Evaluating daytime_clear_test using 2007 metric. Note that results do not use the official Matlab API.
[08/31 18:46:57 d2.evaluation.pascal_voc_evaluation]: classwise ap 54.71,47.70,67.45,45.87,48.96,46.73,54.82
[08/31 18:46:57 detectron2]: Evaluation results for daytime_clear_test in csv format:
[08/31 18:46:57 d2.evaluation.testing]: copypaste: Task: bbox
[08/31 18:46:57 d2.evaluation.testing]: copypaste: AP,AP50,AP75
[08/31 18:46:57 d2.evaluation.testing]: copypaste: 29.3687,52.3192,27.2935

Answer 7 · 2023-09-06T09:13:23.000Z

thanks, closing this issue now.

Answer 8 · 2024-04-17T01:45:37.000Z

OK... It seems that I found the reason. The daytime_clear datasets in my server is different to the official version. I replaced the old datasets with the latest version, and the result worked fine. The mAP is 52.3, even better than the reported results.
[08/31 18:44:54 d2.evaluation.pascal_voc_evaluation]: Evaluating daytime_clear_test using 2007 metric. Note that results do not use the official Matlab API.
[08/31 18:46:57 d2.evaluation.pascal_voc_evaluation]: classwise ap 54.71,47.70,67.45,45.87,48.96,46.73,54.82
[08/31 18:46:57 detectron2]: Evaluation results for daytime_clear_test in csv format:
[08/31 18:46:57 d2.evaluation.testing]: copypaste: Task: bbox
[08/31 18:46:57 d2.evaluation.testing]: copypaste: AP,AP50,AP75
[08/31 18:46:57 d2.evaluation.testing]: copypaste: 29.3687,52.3192,27.2935

I have 8289 images when I testing too, You said "The daytime_clear datasets in my server is different to the official version",what is the offcial version?