Parskatt/DeDoDe

Unable to Reproduce Megadepth-1500 Evaluation Results

AbyssGaze opened this issue · 35 comments

I have run the evaluation in Megadepth-1500, but I can't reproduce the results. Some experiments are listed in the table below:
企业微信截图_1692795533225

If you could provide some details about the experimental parameters, that would be great. Thank you very much!

Hi! Yes this does look significantly worse. We ran the sota experiments with 30k keypoints in resized resolution 784, 784 for both dedode-B and -G.

The time in MS looks very low if it includes the ransac. What softmax temperature did you use?

I'll have a look tonight and compare with our internal code to see where this issue is coming from.

Thanks for your reimplementation.
I used the dual_softmax_matcher as in your demo script, and the metric is RANSAC time for pose estimation cost. The time only accounts for the cost in your pipeline, such as detect, describe, and DualSoftMaxMatcher. I only used 10k keypoints for matching. I will test the performance using 30k keypoints later.

def test_step(self, batch, batch_idx):
    with self.profiler.profile("dedode detect"):
        detections_A = self.detector.detect(
            {"image": batch["image0"]}, num_keypoints=30_000
        )
        keypoints_A, P_A = detections_A["keypoints"], detections_A["confidence"]
        detections_B = self.detector.detect(
            {"image": batch["image1"]}, num_keypoints=30_000
        )
        keypoints_B, P_B = detections_B["keypoints"], detections_B["confidence"]
    with self.profiler.profile("dedode describe"):
        description_A = self.descriptor.describe_keypoints(
            {"image": batch["image0"]}, keypoints_A
        )["descriptions"]
        description_B = self.descriptor.describe_keypoints(
            {"image": batch["image1"]}, keypoints_B
        )["descriptions"]
    with self.profiler.profile("dedode match"):
        matches_A, matches_B, batch_ids = self.matcher.match(
            keypoints_A,
            description_A,
            keypoints_B,
            description_B,
            P_A=P_A,
            P_B=P_B,
            normalize=True,
            inv_temp=20,
            threshold=0.01,
        )  # Increasing threshold -> fewer matches, fewer outliers
        W_A, H_A = batch["image0"].shape[3], batch["image0"].shape[2]
        W_B, H_B = batch["image1"].shape[3], batch["image1"].shape[2]
        matches_A, matches_B = self.matcher.to_pixel_coords(
            matches_A, matches_B, H_A, W_A, H_B, W_B
        )
        batch["mkpts0_f"] = matches_A * batch["scale0"]
        batch["mkpts1_f"] = matches_B * batch["scale1"]
        batch["m_bids"] = torch.zeros(matches_A.shape[0]).to(matches_A.device)

are you using the provided benchmark in DeDoDe/benchmarks, or something external?

I'm asking since you're using some kind of different eval script.

We use a fixed intrinsic of longer length 1200 during eval, with a pixel threshold of 0.5. It might be the case that the version you are using does something else?

I'm saying this because it seems that the main difference between our results and the ones you report are in AUC@5, and AUC@20 is more similar. This indicates to me that there is a difference in the estimator used.

I evaluated the decode metric in LoFTR benchmark(https://github.com/zju3dv/LoFTR), ensuring that all algorithms were maintained under the same conditions for comparison.

Can you try using our provided benchmark? There may be subtle differences how things are handled that can mess with the pixel level accuracy.

Ok, thanks for your suggestion.

Also how did you load the images, are they grayscale in loftr, and are they even normalized? From what I remember the loftr image loading is quite weird, might mess with dedode

We use color images with standardized with imagenet mean and std

I have converted grayscale images to color with three channels and used the same augmentation with parameters as: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. I have reviewed your evaluation code, and it appears that you scaled the intrinsic parameters (essentially scaling the images) before evaluating. In addition, the RANSAC threshold scale also changes, as shown in the code below:

# LoFTR
ransac_thr = thresh / np.mean([K0[0, 0], K1[1, 1], K0[0, 0], K1[1, 1]])
# Dedode
norm_threshold = threshold / (np.mean(np.abs(K1[:2, :2])) + np.mean(np.abs(K2[:2, :2])))

I think this may not be entirely fair when comparing with other algorithms. As I used 30k keypoints for matching, the metric increased:
企业微信截图_1692795533225
I will check again to ensure that correct.

It depends what you mean with fair :D other works are resizing the images and the intrinsics correspondingly, this is quite similar to what we do. We used the same eval for all the other detector descriptor methods, so I think it's fair.

Yeah and the loftr normalization is bugged, but it doesnt cause major changes.

You may be right, so I am currently testing to ensure that these methods are evaluated under the same conditions. Some experiments are as follows:
企业微信截图_1692796044158
Your paper offers insightful thoughts, and I greatly appreciate your work.

I'm guessing the ransac threshold you use is for 784 instead of 1200 longer side? This might matter for sharp thresholds. I agree that it might be somewhat unfair to loftr to use 1200, on the other hand aspanformer used 1152, and lightglue use the best performing threshold. Basically its not easy to make it fair for everyone.

I think the fairest approach is that of lightglue, use the optimal threshold for each matcher. Then probably loftr would have a higher score (pretty sure they should beat us).

Yes, compared to detection and segmentation tasks, our field is more significantly influenced by pre- and post-processing. It seems essential to build a unified codebase to ensure fairness in evaluating various algorithms. In summary, I greatly appreciate your response and hope that we can all contribute to this field together.

I have converted grayscale images to color with three channels and used the same augmentation with parameters as: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. I have reviewed your evaluation code, and it appears that you scaled the intrinsic parameters (essentially scaling the images) before evaluating. In addition, the RANSAC threshold scale also changes, as shown in the code below:

# LoFTR
ransac_thr = thresh / np.mean([K0[0, 0], K1[1, 1], K0[0, 0], K1[1, 1]])
# Dedode
norm_threshold = threshold / (np.mean(np.abs(K1[:2, :2])) + np.mean(np.abs(K2[:2, :2])))

I think this may not be entirely fair when comparing with other algorithms. As I used 30k keypoints for matching, the metric increased: 企业微信截图_1692795533225 I will check again to ensure that correct.

You cannot convert grayscale tocrgb, information will be lost. Although I guess the impact of this is minor.

I read the images as RGB directly, not by converting from grayscale.

I will try running our own benchmark and compare it with 784,784 intrinsics later tonight.

Although probably better to compare to loftr with optimal threshold

I was actually planning to do that experiment as well. Since you have already done it, I am looking forward to seeing your results. I am also getting off work now, so I wish you all the best!

I looked at LightGlue paper again, I think Sarlin said he would look into the LoFTR numbers, but seems they didn't change them. @AbyssGaze perhaps you can check if you can improve the LoFTR scores by using 1152 or 1200? If so I will update our paper with the better LoFTR scores.

Also you have to be careful with what is considered "same conditions", for example if you force RoMa or DKM to use images of size 1600 they will break.

Bye bye for now :D

I'm not able to reproduce the poor results. I set the intrinsics to be equivalent of longer side 840, and got the following results with DeDoDe-B:

AUC@5: 58.6, AUC@10: 71.6, AUC@20: 81.0

I'll upload the eval script I used

See here:

https://github.com/Parskatt/DeDoDe/blob/main/experiments/eval/eval_dedode_descriptor-B.py

Not exactly sure why it works better than previously. @AbyssGaze please let me know if you get similar results (+-1) running this script.

I noticed a significant difference in the results when comparing the ransac method used in your script, which is cv2.USAC_ACCURATE, to the cv2.RANSAC method. When I used cv2.RANSAC without making any other changes, the metric decreased to:
None auc: [0.4340217424129329, 0.6105990766754759, 0.7489782173192135]

Oh wow, there's a double definition of estimate_pose in utils.py... That's very unfortunate. Let me rerun the experiments too.

@AbyssGaze Yes, unfortunately, I'm able to reproduce your results. I'll rerun the eval on mega-1500 for all keypoint methods (we used the same incorrect estimator for all), and update the paper as soon as possible. Thanks a lot for finding this out, it would have been horrible to find out later!

Without adjusting the intrinsics (I guess this corresponds to the setting in LoFTR), we get:

DeDoDe-B: [49.4, 65.5, 77.7]
DeDoDe-G: [52.8, 69.7, 82.0]
ALIKED: [41.9, 58.4, 71.7]
DISK: [35.0, 51.4, 64.9]
SiLK: [39.9, 55.1, 66.9] # Note: Reran with threshold 0.05 instead of 0.01 as it seems to work better for mega-1500, should also run 0.01

Thank you for your reply. It is necessary to maintain the same ransac method throughout the entire evaluation. There is also a lot of work being done on subsequent outlier filtering, which plays a crucial role in improving overall performance.

It is necessary to maintain the same ransac method throughout the entire evaluation

Yes, definitely!

There is also a lot of work being done on subsequent outlier filtering, which plays a crucial role in improving overall performance.

Yes, in this work we wanted to see how good performance we get without any such filtering. Of course we expect the performance of DeDoDe to increase further with filtering.

I'm closing this and opening a new issue #13 to track the updated results.