ViTAE-Transformer/ViTPose

Training on test images when using CrowdPose?

1281167 opened this issue · 5 comments

Dear authors, thanks for the exciting work and I'd like to apologize in advance if I misunderstood.

As you may already know, CrowdPose dataset itself is constituted by cherry-picked crowd samples selected from MSCOCO, MPII and AIC, but CrowdPose did not specify if they treated train/val/test images from MSCOCO/MPII/AIC differently. They also re-annotated (presumably more accurately) these samples.

What we have noticed is that many of the test images in "MS COCO val set" also present in "CrowdPose train" and "CrowdPose train/val" splits. Although CrowdPose has renamed all their images, we have identified at least 181 images in "CrowdPose train/val" having the same md5 info as in "MS COCO val set".

For example, "108951.jpg" in "CrowdPose train" and "000000147740.jpg" in "MS COCO val set" are the same image with md5: f9fc120dc085166b30c08da3de333b69

We did not identify any image overlap between CrowdPose and MPII/AIC on md5 level for both train and test images, possibly because CrowdPose did some preprocessing for selected MPII/AIC images, but based on the finding on COCO, the possibility for such train-test overlap with MPII/AIC is notable. We have not checked if "CrowdPose test" images also present in "COCO train set" yet.

So if I did not miss anything, the model jointly trained on COCO+AIC+MPII+CrowdPose would have seen many of the test images (with labels, at least for COCO) during the training process, making the results untrustworthy.

Thanks for your comments. As demonstrated in Table 6, the performance gains brought by crowdpose is relatively small compared with the performance gains brought by AIC. We suspect that the usage of AIC dataset is more important in the multiple dataset setting and the overlap between COCO and crowdpose is rather small. The results without Crowdpose datasets are already SOTA, which does not affect the conclusion. Besides, the annotations for COCO and Crowdpose datasets are different, and the images from Crowdpose datasets are not processed with the MS COCO head. We will remove these replicated images and re examine the sources of the performance gains. What's more, according to Table 8, the single task results of ViTPose variants with MS COCO are already SOTA. The multiple datasets results are only used to demonstrate the flexiblity of the proposed ViTPose. The biggest model, i.e., ViTPose-G, is trained with MS COCO and AIC only and obtains the 81.0 performance on the MS COCO test set.

Thanks again. We will exclude the influence of Crowdpose and retrain the models as soon as possible. Please stay tuned.

Thanks for the prompt reply! And sorry for being vague in my previous comment; I believe the results in the paper, except those with CrowdPose, are solid and we have also verified some of them ourselves. Looking forward to the updated version of the results.

Regarding your comment "We will remove these replicated images and re examine the sources of the performance gains.", We only checked the overlap issue with md5, and this doesn't work on AIC/MPII (no duplicated md5 info, even for training set images, so it seems all image files are altered), given the situation on COCO, It's likely that there are also test images of AIC/MPII in Crowdpose train set, but since all image files are renamed and some altered in bit level, It's hard to identify all duplications. We have contacted the authors of Crowdpose on this issue, but they haven't replied yet. So I personally think Crowdpose is not suited for joint training with COCO/AIC/MPII at this point.

It is indeed a challenging problem to reduce the overlap between two datasets. Thus we will remove the CrowdPose-related joint training results at this moment. If the authors of Crowdpose supply the source information of images, we can easily investigate the source of the performance gains caused by CrowPose. Alternatively, we may consider other techniques like local descriptors to match the images in two datasets. However, this still needs human efforts to guarantee match quality. You’re very welcome if you have other better ideas to solve this issue. Please feel free to contact us!

Hi,

We updated the training results for ViTPose-B, ViTPose-L, and ViTPose-H without images from the CrowdPose dataset. Due to the reduction of duplicate images and in the number of training images, the removal of the CrowdPose data decreased the AP on the MS COCO valuation set by about 0.4. Since there are also unique images in the CrowdPose training set, it is difficult to determine how much of an impact the duplicate images have for now. For the evaluation of the OCHuman dataset, the ViTPose variant experienced an AP drop of 0.2 to 0.6. This suggests that occlusion scenes in the training data can help ViTPose generalize better, although the performance is SOTA without these training data. Besides, the performance is almost the same on both MPII and AIC datasets, regardless of whether CrowdPose is used for training. We suspect that heavy occlusion scenarios are not common in these two datasets and ViTPose can generalize well in these cases.

Thanks again for your interest, we are trying to figure out the actual impact of duplicate images. It' s rather a challenging problem, so if there are other better ideas to solve this issue, please feel free to contact us.

Best,

It seems there are no further questions. I will close this issue temporarily. If you have any more questions, please feel free to re-open it.