pytorch/vision

Cannot reproduce performance of Keypoint R-CNN with ResNet-50

yoshitomo-matsubara opened this issue · 7 comments

I tried two approaches for reproducing the performance of Keypoint R-CNN with ResNet-50, box AP = 54.6, keypoint AP = 65.0:
a) use pretrained Keypoint R-CNN with train.py
b) train Keypoint R-CNN by myself with train.py

But either didn't reproduce the performance. As for a), my guess is I need to set some parameters besides pretrained flag.
Could you please help me reproduce the performance, hopefully for both a) and b)? More details about my results are given as follows.

Environment

  • 3 GPUs
  • Ubuntu 18.04 LTS
  • Python 3.6.8
  • torch==1.3.1
  • torchvision==0.4.2

Details

a) use pretrained Keypoint R-CNN with train.py

command: pipenv run python train.py --data-path ./coco2017/ --dataset coco_kp --model keypointrcnn_resnet50_fpn --test-only --pretrained
log

Not using distributed mode
Namespace(aspect_ratio_group_factor=0, batch_size=2, data_path='./coco2017/', dataset='coco_kp', device='cuda', dist_url='env://', distributed=False, epochs=13, lr=0.02, lr_gamma=0.1, lr_step_size=8, lr_steps=[8, 11], model='keypointrcnn_resnet50_fpn', momentum=0.9, output_dir='.', pretrained=True, print_freq=20, resume='', test_only=True, weight_decay=0.0001, workers=4, world_size=1)
Loading data
loading annotations into memory...
Done (t=6.30s)
creating index...
index created!
loading annotations into memory...
Done (t=0.74s)
creating index...
index created!
Creating data loaders
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [12345 35717]
Creating model
Test:  [   0/5000]  eta: 0:44:23  model_time: 0.2646 (0.2646)  evaluator_time: 0.0069 (0.0069)  time: 0.5326  data: 0.2532  max mem: 624
Test:  [ 100/5000]  eta: 0:07:54  model_time: 0.0764 (0.0810)  evaluator_time: 0.0037 (0.0089)  time: 0.0880  data: 0.0019  max mem: 712
Test:  [ 200/5000]  eta: 0:07:25  model_time: 0.0719 (0.0785)  evaluator_time: 0.0031 (0.0088)  time: 0.0899  data: 0.0018  max mem: 795
Test:  [ 300/5000]  eta: 0:07:08  model_time: 0.0733 (0.0779)  evaluator_time: 0.0040 (0.0082)  time: 0.0933  data: 0.0019  max mem: 817
Test:  [ 400/5000]  eta: 0:06:57  model_time: 0.0720 (0.0780)  evaluator_time: 0.0035 (0.0081)  time: 0.0824  data: 0.0017  max mem: 820
Test:  [ 500/5000]  eta: 0:06:43  model_time: 0.0656 (0.0772)  evaluator_time: 0.0032 (0.0077)  time: 0.0851  data: 0.0019  max mem: 820
Test:  [ 600/5000]  eta: 0:06:38  model_time: 0.0693 (0.0780)  evaluator_time: 0.0033 (0.0082)  time: 0.0793  data: 0.0018  max mem: 846
Test:  [ 700/5000]  eta: 0:06:32  model_time: 0.0678 (0.0783)  evaluator_time: 0.0034 (0.0085)  time: 0.0820  data: 0.0018  max mem: 853
Test:  [ 800/5000]  eta: 0:06:21  model_time: 0.0731 (0.0782)  evaluator_time: 0.0032 (0.0083)  time: 0.0805  data: 0.0017  max mem: 853
Test:  [ 900/5000]  eta: 0:06:12  model_time: 0.0748 (0.0782)  evaluator_time: 0.0029 (0.0084)  time: 0.0851  data: 0.0015  max mem: 858
Test:  [1000/5000]  eta: 0:06:01  model_time: 0.0713 (0.0779)  evaluator_time: 0.0030 (0.0082)  time: 0.0884  data: 0.0019  max mem: 858
Test:  [1100/5000]  eta: 0:05:52  model_time: 0.0713 (0.0778)  evaluator_time: 0.0040 (0.0082)  time: 0.0859  data: 0.0018  max mem: 858
Test:  [1200/5000]  eta: 0:05:43  model_time: 0.0715 (0.0780)  evaluator_time: 0.0031 (0.0082)  time: 0.0941  data: 0.0018  max mem: 872
Test:  [1300/5000]  eta: 0:05:36  model_time: 0.0725 (0.0783)  evaluator_time: 0.0033 (0.0085)  time: 0.0847  data: 0.0017  max mem: 872
Test:  [1400/5000]  eta: 0:05:28  model_time: 0.0780 (0.0785)  evaluator_time: 0.0042 (0.0086)  time: 0.1081  data: 0.0020  max mem: 872
Test:  [1500/5000]  eta: 0:05:18  model_time: 0.0718 (0.0782)  evaluator_time: 0.0033 (0.0085)  time: 0.0884  data: 0.0017  max mem: 872
Test:  [1600/5000]  eta: 0:05:08  model_time: 0.0752 (0.0782)  evaluator_time: 0.0047 (0.0084)  time: 0.1013  data: 0.0020  max mem: 872
Test:  [1700/5000]  eta: 0:05:00  model_time: 0.0687 (0.0784)  evaluator_time: 0.0032 (0.0085)  time: 0.0954  data: 0.0019  max mem: 884
Test:  [1800/5000]  eta: 0:04:50  model_time: 0.0665 (0.0782)  evaluator_time: 0.0028 (0.0084)  time: 0.0767  data: 0.0016  max mem: 884
Test:  [1900/5000]  eta: 0:04:41  model_time: 0.0689 (0.0782)  evaluator_time: 0.0027 (0.0085)  time: 0.0863  data: 0.0014  max mem: 888
Test:  [2000/5000]  eta: 0:04:32  model_time: 0.0712 (0.0781)  evaluator_time: 0.0032 (0.0084)  time: 0.0873  data: 0.0017  max mem: 888
Test:  [2100/5000]  eta: 0:04:22  model_time: 0.0720 (0.0781)  evaluator_time: 0.0028 (0.0084)  time: 0.0955  data: 0.0017  max mem: 888
Test:  [2200/5000]  eta: 0:04:13  model_time: 0.0734 (0.0780)  evaluator_time: 0.0039 (0.0083)  time: 0.0938  data: 0.0019  max mem: 888
Test:  [2300/5000]  eta: 0:04:04  model_time: 0.0688 (0.0781)  evaluator_time: 0.0027 (0.0083)  time: 0.0816  data: 0.0015  max mem: 894
Test:  [2400/5000]  eta: 0:03:55  model_time: 0.0777 (0.0781)  evaluator_time: 0.0032 (0.0083)  time: 0.0898  data: 0.0017  max mem: 895
Test:  [2500/5000]  eta: 0:03:46  model_time: 0.0704 (0.0783)  evaluator_time: 0.0034 (0.0084)  time: 0.0905  data: 0.0018  max mem: 895
Test:  [2600/5000]  eta: 0:03:37  model_time: 0.0723 (0.0783)  evaluator_time: 0.0030 (0.0083)  time: 0.0892  data: 0.0015  max mem: 895
Test:  [2700/5000]  eta: 0:03:28  model_time: 0.0708 (0.0783)  evaluator_time: 0.0029 (0.0084)  time: 0.0847  data: 0.0016  max mem: 896
Test:  [2800/5000]  eta: 0:03:19  model_time: 0.0719 (0.0782)  evaluator_time: 0.0032 (0.0083)  time: 0.0906  data: 0.0017  max mem: 896
Test:  [2900/5000]  eta: 0:03:10  model_time: 0.0741 (0.0782)  evaluator_time: 0.0037 (0.0083)  time: 0.0879  data: 0.0019  max mem: 896
Test:  [3000/5000]  eta: 0:03:01  model_time: 0.0756 (0.0783)  evaluator_time: 0.0042 (0.0083)  time: 0.0950  data: 0.0018  max mem: 900
Test:  [3100/5000]  eta: 0:02:51  model_time: 0.0709 (0.0782)  evaluator_time: 0.0029 (0.0082)  time: 0.0834  data: 0.0017  max mem: 900
Test:  [3200/5000]  eta: 0:02:42  model_time: 0.0734 (0.0782)  evaluator_time: 0.0035 (0.0082)  time: 0.0858  data: 0.0017  max mem: 900
Test:  [3300/5000]  eta: 0:02:34  model_time: 0.0726 (0.0783)  evaluator_time: 0.0029 (0.0083)  time: 0.0946  data: 0.0017  max mem: 903
Test:  [3400/5000]  eta: 0:02:24  model_time: 0.0687 (0.0782)  evaluator_time: 0.0032 (0.0082)  time: 0.0788  data: 0.0017  max mem: 903
Test:  [3500/5000]  eta: 0:02:15  model_time: 0.0685 (0.0782)  evaluator_time: 0.0030 (0.0082)  time: 0.0822  data: 0.0017  max mem: 903
Test:  [3600/5000]  eta: 0:02:06  model_time: 0.0764 (0.0783)  evaluator_time: 0.0029 (0.0082)  time: 0.0878  data: 0.0016  max mem: 903
Test:  [3700/5000]  eta: 0:01:57  model_time: 0.0739 (0.0783)  evaluator_time: 0.0043 (0.0082)  time: 0.0979  data: 0.0020  max mem: 903
Test:  [3800/5000]  eta: 0:01:48  model_time: 0.0790 (0.0783)  evaluator_time: 0.0047 (0.0083)  time: 0.1088  data: 0.0021  max mem: 906
Test:  [3900/5000]  eta: 0:01:39  model_time: 0.0701 (0.0782)  evaluator_time: 0.0029 (0.0082)  time: 0.0775  data: 0.0016  max mem: 906
Test:  [4000/5000]  eta: 0:01:30  model_time: 0.0720 (0.0782)  evaluator_time: 0.0035 (0.0081)  time: 0.0886  data: 0.0016  max mem: 906
Test:  [4100/5000]  eta: 0:01:21  model_time: 0.0739 (0.0782)  evaluator_time: 0.0037 (0.0082)  time: 0.0856  data: 0.0019  max mem: 906
Test:  [4200/5000]  eta: 0:01:12  model_time: 0.0745 (0.0781)  evaluator_time: 0.0032 (0.0081)  time: 0.0894  data: 0.0018  max mem: 906
Test:  [4300/5000]  eta: 0:01:03  model_time: 0.0754 (0.0781)  evaluator_time: 0.0039 (0.0081)  time: 0.0880  data: 0.0018  max mem: 906
Test:  [4400/5000]  eta: 0:00:54  model_time: 0.0709 (0.0780)  evaluator_time: 0.0032 (0.0081)  time: 0.0966  data: 0.0017  max mem: 906
Test:  [4500/5000]  eta: 0:00:45  model_time: 0.0742 (0.0780)  evaluator_time: 0.0033 (0.0081)  time: 0.0984  data: 0.0017  max mem: 906
Test:  [4600/5000]  eta: 0:00:36  model_time: 0.0746 (0.0779)  evaluator_time: 0.0034 (0.0080)  time: 0.0879  data: 0.0018  max mem: 906
Test:  [4700/5000]  eta: 0:00:27  model_time: 0.0749 (0.0780)  evaluator_time: 0.0036 (0.0080)  time: 0.0969  data: 0.0017  max mem: 906
Test:  [4800/5000]  eta: 0:00:18  model_time: 0.0732 (0.0780)  evaluator_time: 0.0037 (0.0080)  time: 0.1013  data: 0.0017  max mem: 906
Test:  [4900/5000]  eta: 0:00:09  model_time: 0.0785 (0.0780)  evaluator_time: 0.0056 (0.0080)  time: 0.0949  data: 0.0019  max mem: 906
Test:  [4999/5000]  eta: 0:00:00  model_time: 0.0710 (0.0780)  evaluator_time: 0.0031 (0.0080)  time: 0.0817  data: 0.0017  max mem: 906
Test: Total time: 0:07:30 (0.0901 s / it)
Averaged stats: model_time: 0.0710 (0.0780)  evaluator_time: 0.0031 (0.0080)
Accumulating evaluation results...
DONE (t=1.05s).
Accumulating evaluation results...
DONE (t=0.30s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.502
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.796
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.545
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.341
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.591
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.176
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.519
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.603
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.460
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.669
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.738
IoU metric: keypoints
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.599
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.834
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.650
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.553
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.675
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.672
 Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.889
 Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.721
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.623
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.741

b) use pretrained Keypoint R-CNN with train.py

command: pipenv run python -m torch.distributed.launch --nproc_per_node=3 --use_env python train.py --data-path ./coco2017/ --dataset coco_kp --model keypointrcnn_resnet50_fpn --world-size 3 --lr 0.0075

Learning rate lr is set by following a suggestion in train.py

If you use different number of gpus, the learning rate should be changed to 0.02/8*$NGPU.

box AP = 50.6 (Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ])
keypoint AP = 61.1 (Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ])

Thank you!

Hi @yoshitomo-matsubara

Thanks a lot for opening this issue!
I just tried myself running the pre-trained model, and I obtained the same performance as you.
This indicates that there is a regression either in torchvision or in PyTorch (or both).

I'm having a look at it

Some follow-up information: both fasterrcnn_resnet50_fpn and maskrcnn_resnet50_fpn reproduces the expected results, so it's a problem only with the keypointrcnn_resnet50_fpn codepath.

Ok, I think I found the issue.

I have mistakenly took the wrong model checkpoint when uploading the model for keypoint rcnn. I took the checkpoint for epoch 29, instead of the one for epoch 45...

Because the number of images is smaller in the person keypoint subset of COCO, the number of epochs should be adapted so that we have the same number of iterations.

I'll upload the correct weights soon and let you know, thanks a lot for opening the issue!

@yoshitomo-matsubara should be fixed in #1609

Hi @fmassa
Thank you so much for the quick responses and updates!
I just tried to use your updated weights, and it gave me box AP = 0.546 and keypoint AP = 0.650 :)

One more quick quick question about case b of mine above before you close this issue:
To achieve this performance, did you set --epoch 29 for keypoint rcnn, and --epoch 45 for faster and mask rcnns as you trained the object detectors?
I'm asking this question since the provided train.py uses epoch=13 by default.

It would be very appreciated if you can provide the hyperparameters (as a document or comments like this) to train each of the models so that we can have a better idea as we train them by ourselves e.g., using different datasets, models, etc

@yoshitomo-matsubara I'll open a PR with the training hyperparameters for training.
In the meantime, here are the ones I used, which corresponds roughly to the 2x schedule:

Faster R-CNN

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco --model fasterrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3

Mask R-CNN

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3

Keypoint R-CNN

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco_kp --model keypointrcnn_resnet50_fpn --epochs 46 --lr-steps 36 43 --aspect-ratio-group-factor 3

Also, if you could send a PR with those schedules it would be great!

@fmassa
Sure! I just sent a PR with the schedules #1611
Thanks!