No predictions from model
normster opened this issue · 3 comments
Hi,
I'm trying to train a detection model with the plain ViT backbone on 8 GPUs (by scaling down batch size + lr 4x) using the 100 epoch config. Training seems to progress nicely until evaluation, at which point I get the following log statements:
[05/25 13:32:36 fvcore.common.checkpoint]: Saving checkpoint to output/benchmarking_mask_rcnn_base_FPN_100ep_LSJ_mae/model_0005531.pth
[05/25 13:32:40 d2.data.datasets.coco]: Loaded 5000 images in COCO format from datasets/coco/annotations/instances_val2017.json
[05/25 13:32:41 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(1024, 1024), max_size=1024), FixedSizeCrop(crop_size=[1024, 1024])]
[05/25 13:32:41 d2.data.common]: Serializing 5000 elements to byte tensors and concatenating them all ...
[05/25 13:32:41 d2.data.common]: Serialized dataset takes 19.10 MiB
[05/25 13:32:41 d2.evaluation.evaluator]: Start inference on 625 images
[05/25 13:32:58 d2.evaluation.evaluator]: Inference done 11/625. 0.3023 s / img. ETA=0:03:10
[05/25 13:33:03 d2.evaluation.evaluator]: Inference done 28/625. 0.3014 s / img. ETA=0:03:03
[05/25 13:33:08 d2.evaluation.evaluator]: Inference done 45/625. 0.2988 s / img. ETA=0:02:56
[05/25 13:33:13 d2.evaluation.evaluator]: Inference done 61/625. 0.3007 s / img. ETA=0:02:53
[05/25 13:33:19 d2.evaluation.evaluator]: Inference done 78/625. 0.3015 s / img. ETA=0:02:48
[05/25 13:33:24 d2.evaluation.evaluator]: Inference done 94/625. 0.3023 s / img. ETA=0:02:44
[05/25 13:33:29 d2.evaluation.evaluator]: Inference done 110/625. 0.3029 s / img. ETA=0:02:39
[05/25 13:33:34 d2.evaluation.evaluator]: Inference done 127/625. 0.3021 s / img. ETA=0:02:34
[05/25 13:33:39 d2.evaluation.evaluator]: Inference done 144/625. 0.3020 s / img. ETA=0:02:28
[05/25 13:33:44 d2.evaluation.evaluator]: Inference done 161/625. 0.3016 s / img. ETA=0:02:23
[05/25 13:33:49 d2.evaluation.evaluator]: Inference done 177/625. 0.3020 s / img. ETA=0:02:18
[05/25 13:33:54 d2.evaluation.evaluator]: Inference done 193/625. 0.3030 s / img. ETA=0:02:14
[05/25 13:34:00 d2.evaluation.evaluator]: Inference done 210/625. 0.3029 s / img. ETA=0:02:08
[05/25 13:34:05 d2.evaluation.evaluator]: Inference done 226/625. 0.3032 s / img. ETA=0:02:04
[05/25 13:34:10 d2.evaluation.evaluator]: Inference done 242/625. 0.3033 s / img. ETA=0:01:59
[05/25 13:34:15 d2.evaluation.evaluator]: Inference done 259/625. 0.3029 s / img. ETA=0:01:53
[05/25 13:34:20 d2.evaluation.evaluator]: Inference done 275/625. 0.3031 s / img. ETA=0:01:48
[05/25 13:34:25 d2.evaluation.evaluator]: Inference done 292/625. 0.3029 s / img. ETA=0:01:43
[05/25 13:34:31 d2.evaluation.evaluator]: Inference done 309/625. 0.3028 s / img. ETA=0:01:38
[05/25 13:34:36 d2.evaluation.evaluator]: Inference done 326/625. 0.3027 s / img. ETA=0:01:32
[05/25 13:34:41 d2.evaluation.evaluator]: Inference done 342/625. 0.3028 s / img. ETA=0:01:27
[05/25 13:34:46 d2.evaluation.evaluator]: Inference done 359/625. 0.3026 s / img. ETA=0:01:22
[05/25 13:34:51 d2.evaluation.evaluator]: Inference done 376/625. 0.3022 s / img. ETA=0:01:17
[05/25 13:34:56 d2.evaluation.evaluator]: Inference done 393/625. 0.3021 s / img. ETA=0:01:11
[05/25 13:35:02 d2.evaluation.evaluator]: Inference done 410/625. 0.3022 s / img. ETA=0:01:06
[05/25 13:35:07 d2.evaluation.evaluator]: Inference done 426/625. 0.3024 s / img. ETA=0:01:01
[05/25 13:35:12 d2.evaluation.evaluator]: Inference done 443/625. 0.3018 s / img. ETA=0:00:56
[05/25 13:35:17 d2.evaluation.evaluator]: Inference done 460/625. 0.3018 s / img. ETA=0:00:51
[05/25 13:35:22 d2.evaluation.evaluator]: Inference done 477/625. 0.3016 s / img. ETA=0:00:45
[05/25 13:35:27 d2.evaluation.evaluator]: Inference done 493/625. 0.3020 s / img. ETA=0:00:40
[05/25 13:35:33 d2.evaluation.evaluator]: Inference done 510/625. 0.3020 s / img. ETA=0:00:35
[05/25 13:35:38 d2.evaluation.evaluator]: Inference done 527/625. 0.3019 s / img. ETA=0:00:30
[05/25 13:35:43 d2.evaluation.evaluator]: Inference done 543/625. 0.3022 s / img. ETA=0:00:25
[05/25 13:35:48 d2.evaluation.evaluator]: Inference done 560/625. 0.3021 s / img. ETA=0:00:20
[05/25 13:35:53 d2.evaluation.evaluator]: Inference done 577/625. 0.3018 s / img. ETA=0:00:14
[05/25 13:35:58 d2.evaluation.evaluator]: Inference done 593/625. 0.3019 s / img. ETA=0:00:09
[05/25 13:36:03 d2.evaluation.evaluator]: Inference done 610/625. 0.3018 s / img. ETA=0:00:04
[05/25 13:36:08 d2.evaluation.evaluator]: Total inference time: 0:03:12.073198 (0.309795 s / img per device, on 8 devices)
[05/25 13:36:08 d2.evaluation.evaluator]: Total inference pure compute time: 0:03:06 (0.301541 s / img per device, on 8 devices)
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Preparing results for COCO format ...
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Saving results to output/benchmarking_mask_rcnn_base_FPN_100ep_LSJ_mae/coco_instances_results.json
[05/25 13:36:08 d2.evaluation.coco_evaluation]: Evaluating predictions with unofficial COCO API...
WARNING [05/25 13:36:08 d2.evaluation.coco_evaluation]: No predictions from the model!
[05/25 13:36:08 d2.evaluation.testing]: copypaste: Task: bbox
[05/25 13:36:08 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[05/25 13:36:08 d2.evaluation.testing]: copypaste: nan,nan,nan,nan,nan,nan
[05/25 13:36:15 d2.utils.events]: eta: 1 day, 16:30:37 iter: 5539 total_loss: 1.138 loss_cls: 0.2928 loss_box_reg: 0.2585 loss_mask: 0.3725 loss_rpn_cls: 0.06554 loss_rpn_loc: 0.1448 time: 0.8175 data_time: 0.0220 lr: 1.9955e-05 max_mem: 26732M
[05/25 13:36:31 d2.utils.events]: eta: 1 day, 16:30:20 iter: 5559 total_loss: 1.207 loss_cls: 0.3106 loss_box_reg: 0.2719 loss_mask: 0.3847 loss_rpn_cls: 0.06758 loss_rpn_loc: 0.1353 time: 0.8175 data_time: 0.0225 lr: 1.9955e-05 max_mem: 26732M
Has anyone else seen this before? Training continues without any apparent problems after eval so it's not an issue with divergence.
Thanks!
Hi, @normster. Thanks for your interest in our work.
I suggest scaling down the batch size by 4x & scaling down the lr by 2x.
I also suggest aligning your environment with ours, please also see SysCV/transfiner#17 (comment).
What environment should I use? The environment in the comment you linked differs from what was suggested in the readme of this repo.
I don't think torch/d2 versions are the cause of this: running evaluation on downloaded weights gives predictions and results are in line with reported numbers.