Cannot reproduce performance of Keypoint R-CNN with ResNet-50
yoshitomo-matsubara opened this issue · 7 comments
I tried two approaches for reproducing the performance of Keypoint R-CNN with ResNet-50, box AP = 54.6, keypoint AP = 65.0:
a) use pretrained Keypoint R-CNN with train.py
b) train Keypoint R-CNN by myself with train.py
But either didn't reproduce the performance. As for a), my guess is I need to set some parameters besides pretrained flag.
Could you please help me reproduce the performance, hopefully for both a) and b)? More details about my results are given as follows.
Environment
- 3 GPUs
- Ubuntu 18.04 LTS
- Python 3.6.8
- torch==1.3.1
- torchvision==0.4.2
Details
a) use pretrained Keypoint R-CNN with train.py
command: pipenv run python train.py --data-path ./coco2017/ --dataset coco_kp --model keypointrcnn_resnet50_fpn --test-only --pretrained
log
Not using distributed mode
Namespace(aspect_ratio_group_factor=0, batch_size=2, data_path='./coco2017/', dataset='coco_kp', device='cuda', dist_url='env://', distributed=False, epochs=13, lr=0.02, lr_gamma=0.1, lr_step_size=8, lr_steps=[8, 11], model='keypointrcnn_resnet50_fpn', momentum=0.9, output_dir='.', pretrained=True, print_freq=20, resume='', test_only=True, weight_decay=0.0001, workers=4, world_size=1)
Loading data
loading annotations into memory...
Done (t=6.30s)
creating index...
index created!
loading annotations into memory...
Done (t=0.74s)
creating index...
index created!
Creating data loaders
Using [0, 1.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [12345 35717]
Creating model
Test: [ 0/5000] eta: 0:44:23 model_time: 0.2646 (0.2646) evaluator_time: 0.0069 (0.0069) time: 0.5326 data: 0.2532 max mem: 624
Test: [ 100/5000] eta: 0:07:54 model_time: 0.0764 (0.0810) evaluator_time: 0.0037 (0.0089) time: 0.0880 data: 0.0019 max mem: 712
Test: [ 200/5000] eta: 0:07:25 model_time: 0.0719 (0.0785) evaluator_time: 0.0031 (0.0088) time: 0.0899 data: 0.0018 max mem: 795
Test: [ 300/5000] eta: 0:07:08 model_time: 0.0733 (0.0779) evaluator_time: 0.0040 (0.0082) time: 0.0933 data: 0.0019 max mem: 817
Test: [ 400/5000] eta: 0:06:57 model_time: 0.0720 (0.0780) evaluator_time: 0.0035 (0.0081) time: 0.0824 data: 0.0017 max mem: 820
Test: [ 500/5000] eta: 0:06:43 model_time: 0.0656 (0.0772) evaluator_time: 0.0032 (0.0077) time: 0.0851 data: 0.0019 max mem: 820
Test: [ 600/5000] eta: 0:06:38 model_time: 0.0693 (0.0780) evaluator_time: 0.0033 (0.0082) time: 0.0793 data: 0.0018 max mem: 846
Test: [ 700/5000] eta: 0:06:32 model_time: 0.0678 (0.0783) evaluator_time: 0.0034 (0.0085) time: 0.0820 data: 0.0018 max mem: 853
Test: [ 800/5000] eta: 0:06:21 model_time: 0.0731 (0.0782) evaluator_time: 0.0032 (0.0083) time: 0.0805 data: 0.0017 max mem: 853
Test: [ 900/5000] eta: 0:06:12 model_time: 0.0748 (0.0782) evaluator_time: 0.0029 (0.0084) time: 0.0851 data: 0.0015 max mem: 858
Test: [1000/5000] eta: 0:06:01 model_time: 0.0713 (0.0779) evaluator_time: 0.0030 (0.0082) time: 0.0884 data: 0.0019 max mem: 858
Test: [1100/5000] eta: 0:05:52 model_time: 0.0713 (0.0778) evaluator_time: 0.0040 (0.0082) time: 0.0859 data: 0.0018 max mem: 858
Test: [1200/5000] eta: 0:05:43 model_time: 0.0715 (0.0780) evaluator_time: 0.0031 (0.0082) time: 0.0941 data: 0.0018 max mem: 872
Test: [1300/5000] eta: 0:05:36 model_time: 0.0725 (0.0783) evaluator_time: 0.0033 (0.0085) time: 0.0847 data: 0.0017 max mem: 872
Test: [1400/5000] eta: 0:05:28 model_time: 0.0780 (0.0785) evaluator_time: 0.0042 (0.0086) time: 0.1081 data: 0.0020 max mem: 872
Test: [1500/5000] eta: 0:05:18 model_time: 0.0718 (0.0782) evaluator_time: 0.0033 (0.0085) time: 0.0884 data: 0.0017 max mem: 872
Test: [1600/5000] eta: 0:05:08 model_time: 0.0752 (0.0782) evaluator_time: 0.0047 (0.0084) time: 0.1013 data: 0.0020 max mem: 872
Test: [1700/5000] eta: 0:05:00 model_time: 0.0687 (0.0784) evaluator_time: 0.0032 (0.0085) time: 0.0954 data: 0.0019 max mem: 884
Test: [1800/5000] eta: 0:04:50 model_time: 0.0665 (0.0782) evaluator_time: 0.0028 (0.0084) time: 0.0767 data: 0.0016 max mem: 884
Test: [1900/5000] eta: 0:04:41 model_time: 0.0689 (0.0782) evaluator_time: 0.0027 (0.0085) time: 0.0863 data: 0.0014 max mem: 888
Test: [2000/5000] eta: 0:04:32 model_time: 0.0712 (0.0781) evaluator_time: 0.0032 (0.0084) time: 0.0873 data: 0.0017 max mem: 888
Test: [2100/5000] eta: 0:04:22 model_time: 0.0720 (0.0781) evaluator_time: 0.0028 (0.0084) time: 0.0955 data: 0.0017 max mem: 888
Test: [2200/5000] eta: 0:04:13 model_time: 0.0734 (0.0780) evaluator_time: 0.0039 (0.0083) time: 0.0938 data: 0.0019 max mem: 888
Test: [2300/5000] eta: 0:04:04 model_time: 0.0688 (0.0781) evaluator_time: 0.0027 (0.0083) time: 0.0816 data: 0.0015 max mem: 894
Test: [2400/5000] eta: 0:03:55 model_time: 0.0777 (0.0781) evaluator_time: 0.0032 (0.0083) time: 0.0898 data: 0.0017 max mem: 895
Test: [2500/5000] eta: 0:03:46 model_time: 0.0704 (0.0783) evaluator_time: 0.0034 (0.0084) time: 0.0905 data: 0.0018 max mem: 895
Test: [2600/5000] eta: 0:03:37 model_time: 0.0723 (0.0783) evaluator_time: 0.0030 (0.0083) time: 0.0892 data: 0.0015 max mem: 895
Test: [2700/5000] eta: 0:03:28 model_time: 0.0708 (0.0783) evaluator_time: 0.0029 (0.0084) time: 0.0847 data: 0.0016 max mem: 896
Test: [2800/5000] eta: 0:03:19 model_time: 0.0719 (0.0782) evaluator_time: 0.0032 (0.0083) time: 0.0906 data: 0.0017 max mem: 896
Test: [2900/5000] eta: 0:03:10 model_time: 0.0741 (0.0782) evaluator_time: 0.0037 (0.0083) time: 0.0879 data: 0.0019 max mem: 896
Test: [3000/5000] eta: 0:03:01 model_time: 0.0756 (0.0783) evaluator_time: 0.0042 (0.0083) time: 0.0950 data: 0.0018 max mem: 900
Test: [3100/5000] eta: 0:02:51 model_time: 0.0709 (0.0782) evaluator_time: 0.0029 (0.0082) time: 0.0834 data: 0.0017 max mem: 900
Test: [3200/5000] eta: 0:02:42 model_time: 0.0734 (0.0782) evaluator_time: 0.0035 (0.0082) time: 0.0858 data: 0.0017 max mem: 900
Test: [3300/5000] eta: 0:02:34 model_time: 0.0726 (0.0783) evaluator_time: 0.0029 (0.0083) time: 0.0946 data: 0.0017 max mem: 903
Test: [3400/5000] eta: 0:02:24 model_time: 0.0687 (0.0782) evaluator_time: 0.0032 (0.0082) time: 0.0788 data: 0.0017 max mem: 903
Test: [3500/5000] eta: 0:02:15 model_time: 0.0685 (0.0782) evaluator_time: 0.0030 (0.0082) time: 0.0822 data: 0.0017 max mem: 903
Test: [3600/5000] eta: 0:02:06 model_time: 0.0764 (0.0783) evaluator_time: 0.0029 (0.0082) time: 0.0878 data: 0.0016 max mem: 903
Test: [3700/5000] eta: 0:01:57 model_time: 0.0739 (0.0783) evaluator_time: 0.0043 (0.0082) time: 0.0979 data: 0.0020 max mem: 903
Test: [3800/5000] eta: 0:01:48 model_time: 0.0790 (0.0783) evaluator_time: 0.0047 (0.0083) time: 0.1088 data: 0.0021 max mem: 906
Test: [3900/5000] eta: 0:01:39 model_time: 0.0701 (0.0782) evaluator_time: 0.0029 (0.0082) time: 0.0775 data: 0.0016 max mem: 906
Test: [4000/5000] eta: 0:01:30 model_time: 0.0720 (0.0782) evaluator_time: 0.0035 (0.0081) time: 0.0886 data: 0.0016 max mem: 906
Test: [4100/5000] eta: 0:01:21 model_time: 0.0739 (0.0782) evaluator_time: 0.0037 (0.0082) time: 0.0856 data: 0.0019 max mem: 906
Test: [4200/5000] eta: 0:01:12 model_time: 0.0745 (0.0781) evaluator_time: 0.0032 (0.0081) time: 0.0894 data: 0.0018 max mem: 906
Test: [4300/5000] eta: 0:01:03 model_time: 0.0754 (0.0781) evaluator_time: 0.0039 (0.0081) time: 0.0880 data: 0.0018 max mem: 906
Test: [4400/5000] eta: 0:00:54 model_time: 0.0709 (0.0780) evaluator_time: 0.0032 (0.0081) time: 0.0966 data: 0.0017 max mem: 906
Test: [4500/5000] eta: 0:00:45 model_time: 0.0742 (0.0780) evaluator_time: 0.0033 (0.0081) time: 0.0984 data: 0.0017 max mem: 906
Test: [4600/5000] eta: 0:00:36 model_time: 0.0746 (0.0779) evaluator_time: 0.0034 (0.0080) time: 0.0879 data: 0.0018 max mem: 906
Test: [4700/5000] eta: 0:00:27 model_time: 0.0749 (0.0780) evaluator_time: 0.0036 (0.0080) time: 0.0969 data: 0.0017 max mem: 906
Test: [4800/5000] eta: 0:00:18 model_time: 0.0732 (0.0780) evaluator_time: 0.0037 (0.0080) time: 0.1013 data: 0.0017 max mem: 906
Test: [4900/5000] eta: 0:00:09 model_time: 0.0785 (0.0780) evaluator_time: 0.0056 (0.0080) time: 0.0949 data: 0.0019 max mem: 906
Test: [4999/5000] eta: 0:00:00 model_time: 0.0710 (0.0780) evaluator_time: 0.0031 (0.0080) time: 0.0817 data: 0.0017 max mem: 906
Test: Total time: 0:07:30 (0.0901 s / it)
Averaged stats: model_time: 0.0710 (0.0780) evaluator_time: 0.0031 (0.0080)
Accumulating evaluation results...
DONE (t=1.05s).
Accumulating evaluation results...
DONE (t=0.30s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.502
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.796
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.545
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.341
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.591
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.176
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.519
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.603
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.460
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.669
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.738
IoU metric: keypoints
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.599
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.834
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.650
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.553
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.675
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.672
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.889
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.721
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.623
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.741
b) use pretrained Keypoint R-CNN with train.py
command: pipenv run python -m torch.distributed.launch --nproc_per_node=3 --use_env python train.py --data-path ./coco2017/ --dataset coco_kp --model keypointrcnn_resnet50_fpn --world-size 3 --lr 0.0075
Learning rate lr
is set by following a suggestion in train.py
If you use different number of gpus, the learning rate should be changed to 0.02/8*$NGPU.
box AP = 50.6 (Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ])
keypoint AP = 61.1 (Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ])
Thank you!
Thanks a lot for opening this issue!
I just tried myself running the pre-trained model, and I obtained the same performance as you.
This indicates that there is a regression either in torchvision or in PyTorch (or both).
I'm having a look at it
Some follow-up information: both fasterrcnn_resnet50_fpn
and maskrcnn_resnet50_fpn
reproduces the expected results, so it's a problem only with the keypointrcnn_resnet50_fpn
codepath.
Ok, I think I found the issue.
I have mistakenly took the wrong model checkpoint when uploading the model for keypoint rcnn. I took the checkpoint for epoch 29, instead of the one for epoch 45...
Because the number of images is smaller in the person keypoint subset of COCO, the number of epochs should be adapted so that we have the same number of iterations.
I'll upload the correct weights soon and let you know, thanks a lot for opening the issue!
@yoshitomo-matsubara should be fixed in #1609
Hi @fmassa
Thank you so much for the quick responses and updates!
I just tried to use your updated weights, and it gave me box AP = 0.546 and keypoint AP = 0.650 :)
One more quick quick question about case b of mine above before you close this issue:
To achieve this performance, did you set --epoch 29 for keypoint rcnn, and --epoch 45 for faster and mask rcnns as you trained the object detectors?
I'm asking this question since the provided train.py uses epoch=13 by default.
It would be very appreciated if you can provide the hyperparameters (as a document or comments like this) to train each of the models so that we can have a better idea as we train them by ourselves e.g., using different datasets, models, etc
@yoshitomo-matsubara I'll open a PR with the training hyperparameters for training.
In the meantime, here are the ones I used, which corresponds roughly to the 2x schedule:
Faster R-CNN
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco --model fasterrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3
Mask R-CNN
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3
Keypoint R-CNN
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco_kp --model keypointrcnn_resnet50_fpn --epochs 46 --lr-steps 36 43 --aspect-ratio-group-factor 3
Also, if you could send a PR with those schedules it would be great!