Unable to Reproduce Megadepth-1500 Evaluation Results

Question

Unable to Reproduce Megadepth-1500 Evaluation Results

AbyssGaze opened this issue 2 years ago · 35 comments

I have run the evaluation in Megadepth-1500, but I can't reproduce the results. Some experiments are listed in the table below:

If you could provide some details about the experimental parameters, that would be great. Thank you very much!

Answer 1 · 2023-08-23T11:54:16.000Z

Hi! Yes this does look significantly worse. We ran the sota experiments with 30k keypoints in resized resolution 784, 784 for both dedode-B and -G.

Answer 2 · 2023-08-23T11:55:19.000Z

The time in MS looks very low if it includes the ransac. What softmax temperature did you use?

Answer 3 · 2023-08-23T11:59:43.000Z

I'll have a look tonight and compare with our internal code to see where this issue is coming from.

Answer 4 · 2023-08-23T12:10:30.000Z

Thanks for your reimplementation.
I used the dual_softmax_matcher as in your demo script, and the metric is RANSAC time for pose estimation cost. The time only accounts for the cost in your pipeline, such as detect, describe, and DualSoftMaxMatcher. I only used 10k keypoints for matching. I will test the performance using 30k keypoints later.

Answer 5 · 2023-08-23T12:18:23.000Z

def test_step(self, batch, batch_idx):
    with self.profiler.profile("dedode detect"):
        detections_A = self.detector.detect(
            {"image": batch["image0"]}, num_keypoints=30_000
        )
        keypoints_A, P_A = detections_A["keypoints"], detections_A["confidence"]
        detections_B = self.detector.detect(
            {"image": batch["image1"]}, num_keypoints=30_000
        )
        keypoints_B, P_B = detections_B["keypoints"], detections_B["confidence"]
    with self.profiler.profile("dedode describe"):
        description_A = self.descriptor.describe_keypoints(
            {"image": batch["image0"]}, keypoints_A
        )["descriptions"]
        description_B = self.descriptor.describe_keypoints(
            {"image": batch["image1"]}, keypoints_B
        )["descriptions"]
    with self.profiler.profile("dedode match"):
        matches_A, matches_B, batch_ids = self.matcher.match(
            keypoints_A,
            description_A,
            keypoints_B,
            description_B,
            P_A=P_A,
            P_B=P_B,
            normalize=True,
            inv_temp=20,
            threshold=0.01,
        )  # Increasing threshold -> fewer matches, fewer outliers
        W_A, H_A = batch["image0"].shape[3], batch["image0"].shape[2]
        W_B, H_B = batch["image1"].shape[3], batch["image1"].shape[2]
        matches_A, matches_B = self.matcher.to_pixel_coords(
            matches_A, matches_B, H_A, W_A, H_B, W_B
        )
        batch["mkpts0_f"] = matches_A * batch["scale0"]
        batch["mkpts1_f"] = matches_B * batch["scale1"]
        batch["m_bids"] = torch.zeros(matches_A.shape[0]).to(matches_A.device)

Answer 6 · 2023-08-23T12:21:48.000Z

are you using the provided benchmark in DeDoDe/benchmarks, or something external?

I'm asking since you're using some kind of different eval script.

Answer 7 · 2023-08-23T12:23:54.000Z

We use a fixed intrinsic of longer length 1200 during eval, with a pixel threshold of 0.5. It might be the case that the version you are using does something else?

Answer 8 · 2023-08-23T12:25:07.000Z

I'm saying this because it seems that the main difference between our results and the ones you report are in AUC@5, and AUC@20 is more similar. This indicates to me that there is a difference in the estimator used.

Answer 9 · 2023-08-23T12:25:51.000Z

I evaluated the decode metric in LoFTR benchmark(https://github.com/zju3dv/LoFTR), ensuring that all algorithms were maintained under the same conditions for comparison.

Answer 10 · 2023-08-23T12:26:39.000Z

Can you try using our provided benchmark? There may be subtle differences how things are handled that can mess with the pixel level accuracy.

Answer 11 · 2023-08-23T12:27:18.000Z

Ok, thanks for your suggestion.

Answer 12 · 2023-08-23T12:50:03.000Z

Also how did you load the images, are they grayscale in loftr, and are they even normalized? From what I remember the loftr image loading is quite weird, might mess with dedode

Answer 13 · 2023-08-23T12:53:26.000Z

We use color images with standardized with imagenet mean and std

Answer 14 · 2023-08-23T13:02:19.000Z

I have converted grayscale images to color with three channels and used the same augmentation with parameters as: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. I have reviewed your evaluation code, and it appears that you scaled the intrinsic parameters (essentially scaling the images) before evaluating. In addition, the RANSAC threshold scale also changes, as shown in the code below:

# LoFTR
ransac_thr = thresh / np.mean([K0[0, 0], K1[1, 1], K0[0, 0], K1[1, 1]])
# Dedode
norm_threshold = threshold / (np.mean(np.abs(K1[:2, :2])) + np.mean(np.abs(K2[:2, :2])))

I think this may not be entirely fair when comparing with other algorithms. As I used 30k keypoints for matching, the metric increased:

I will check again to ensure that correct.

Answer 15 · 2023-08-23T13:05:59.000Z

It depends what you mean with fair :D other works are resizing the images and the intrinsics correspondingly, this is quite similar to what we do. We used the same eval for all the other detector descriptor methods, so I think it's fair.

Answer 16 · 2023-08-23T13:07:02.000Z

Yeah and the loftr normalization is bugged, but it doesnt cause major changes.

Answer 17 · 2023-08-23T13:08:40.000Z

You may be right, so I am currently testing to ensure that these methods are evaluated under the same conditions. Some experiments are as follows:

Your paper offers insightful thoughts, and I greatly appreciate your work.

Answer 18 · 2023-08-23T13:10:28.000Z

I'm guessing the ransac threshold you use is for 784 instead of 1200 longer side? This might matter for sharp thresholds. I agree that it might be somewhat unfair to loftr to use 1200, on the other hand aspanformer used 1152, and lightglue use the best performing threshold. Basically its not easy to make it fair for everyone.

Answer 19 · 2023-08-23T13:12:39.000Z

I think the fairest approach is that of lightglue, use the optimal threshold for each matcher. Then probably loftr would have a higher score (pretty sure they should beat us).

Answer 20 · 2023-08-23T13:13:14.000Z

Yes, compared to detection and segmentation tasks, our field is more significantly influenced by pre- and post-processing. It seems essential to build a unified codebase to ensure fairness in evaluating various algorithms. In summary, I greatly appreciate your response and hope that we can all contribute to this field together.

Answer 21 · 2023-08-23T13:13:45.000Z

I have converted grayscale images to color with three channels and used the same augmentation with parameters as: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. I have reviewed your evaluation code, and it appears that you scaled the intrinsic parameters (essentially scaling the images) before evaluating. In addition, the RANSAC threshold scale also changes, as shown in the code below:
# LoFTR
ransac_thr = thresh / np.mean([K0[0, 0], K1[1, 1], K0[0, 0], K1[1, 1]])
# Dedode
norm_threshold = threshold / (np.mean(np.abs(K1[:2, :2])) + np.mean(np.abs(K2[:2, :2])))
I think this may not be entirely fair when comparing with other algorithms. As I used 30k keypoints for matching, the metric increased: I will check again to ensure that correct.

You cannot convert grayscale tocrgb, information will be lost. Although I guess the impact of this is minor.

Answer 22 · 2023-08-23T13:14:54.000Z

I read the images as RGB directly, not by converting from grayscale.

Answer 23 · 2023-08-23T13:16:36.000Z

I will try running our own benchmark and compare it with 784,784 intrinsics later tonight.

Answer 24 · 2023-08-23T13:17:08.000Z

Although probably better to compare to loftr with optimal threshold

Answer 25 · 2023-08-23T13:22:50.000Z

I was actually planning to do that experiment as well. Since you have already done it, I am looking forward to seeing your results. I am also getting off work now, so I wish you all the best!

Answer 26 · 2023-08-23T13:23:54.000Z

I looked at LightGlue paper again, I think Sarlin said he would look into the LoFTR numbers, but seems they didn't change them. @AbyssGaze perhaps you can check if you can improve the LoFTR scores by using 1152 or 1200? If so I will update our paper with the better LoFTR scores.

Also you have to be careful with what is considered "same conditions", for example if you force RoMa or DKM to use images of size 1600 they will break.

Bye bye for now :D

Answer 27 · 2023-08-23T17:53:14.000Z

I'm not able to reproduce the poor results. I set the intrinsics to be equivalent of longer side 840, and got the following results with DeDoDe-B:

AUC@5: 58.6, AUC@10: 71.6, AUC@20: 81.0

I'll upload the eval script I used

Answer 28 · 2023-08-23T17:59:47.000Z

See here:

https://github.com/Parskatt/DeDoDe/blob/main/experiments/eval/eval_dedode_descriptor-B.py

Not exactly sure why it works better than previously. @AbyssGaze please let me know if you get similar results (+-1) running this script.

Answer 29 · 2023-08-24T03:40:44.000Z

I noticed a significant difference in the results when comparing the ransac method used in your script, which is cv2.USAC_ACCURATE, to the cv2.RANSAC method. When I used cv2.RANSAC without making any other changes, the metric decreased to:
None auc: [0.4340217424129329, 0.6105990766754759, 0.7489782173192135]

Answer 30 · 2023-08-24T04:59:39.000Z

Oh wow, there's a double definition of estimate_pose in utils.py... That's very unfortunate. Let me rerun the experiments too.

Answer 31 · 2023-08-24T05:46:02.000Z

@AbyssGaze Yes, unfortunately, I'm able to reproduce your results. I'll rerun the eval on mega-1500 for all keypoint methods (we used the same incorrect estimator for all), and update the paper as soon as possible. Thanks a lot for finding this out, it would have been horrible to find out later!

Answer 32 · 2023-08-24T05:55:23.000Z

Without adjusting the intrinsics (I guess this corresponds to the setting in LoFTR), we get:

DeDoDe-B: [49.4, 65.5, 77.7]
DeDoDe-G: [52.8, 69.7, 82.0]
ALIKED: [41.9, 58.4, 71.7]
DISK: [35.0, 51.4, 64.9]
SiLK: [39.9, 55.1, 66.9] # Note: Reran with threshold 0.05 instead of 0.01 as it seems to work better for mega-1500, should also run 0.01

Answer 33 · 2023-08-24T06:17:45.000Z

Thank you for your reply. It is necessary to maintain the same ransac method throughout the entire evaluation. There is also a lot of work being done on subsequent outlier filtering, which plays a crucial role in improving overall performance.

Answer 34 · 2023-08-24T06:20:07.000Z

It is necessary to maintain the same ransac method throughout the entire evaluation

Yes, definitely!

There is also a lot of work being done on subsequent outlier filtering, which plays a crucial role in improving overall performance.

Yes, in this work we wanted to see how good performance we get without any such filtering. Of course we expect the performance of DeDoDe to increase further with filtering.

Answer 35 · 2023-08-25T05:28:26.000Z

I'm closing this and opening a new issue #13 to track the updated results.