nianticlabs/scoring-without-correspondences

Question about batch generation for training

efreidun opened this issue · 7 comments

Hi there,

Thank you for the interesting work. I have a couple of questions regarding batch generation for the purpose of training the pipeline.

In the paper it is mentioned that batches of 56 image pairs are used at training, and 500 hypotheses are clustered into bins based on their pose error prior to sampling. I'd like to ask about this binning and sampling process:

  1. What is the pose error quantity that is used for binning, is it max(e^R, e^t)?
  2. How many bins are used?
  3. For each training batch, which I understand has 56 image pairs, how many hypotheses are sampled per image pair?

Thanks in advance!
Fereidoon

Hello!

Thank you for your interest in our work! Here are the answers:

  1. We use the max(e^R, e^t) error for binning the pose hypotheses.
  2. We use a total of 15 bins. We do not create the bins uniformly, since we are mainly interested in poses with low errors. Thus, we use a log function to split the poses such as low error poses are sampled more often. We use ln(x)/0.35, but we did not do an extensive study on this function, and other ways of splitting the data might be better.
  3. During training, we use 56 image pairs and a single hypothesis. We investigated using fewer image pairs but sampling more hypotheses but did not observe a significant difference.

Hope this helps!
Axel

Thank you very much for the swift reply. That clarifies it!

May I also ask for some clarification on a doubt I have regarding the validation splits:

In the paper it is mentioned that the validation splits from LoFTR are followed. If I'm not mistaken LoFTR uses 1500 test pairs from SuperGlue for ScanNet, and 1500 sampled pairs from “Sacre Coeur” and “St. Peter’s Square” for MegaDepth. However, as discussed in the appendix, SuperGlue's (also LoFTR's) are trained for pairs with much higher visual overlap. So my understanding is you draw your own image pair samples with [0.1, 0.4] visual overlap for both ScanNet and MegaDepth.

My doubt is:

  1. Which scenes do you use for drawing the samples? Is it the "test scans" for ScanNet and only “Sacre Coeur” and “St. Peter’s Square” for MegaDepth?
  2. How many sample pairs do you draw for validation of each dataset?
  3. By any chance do you use a separate test split (separate from training/validation) for reporting the results in the tables?

Thanks again!
Fereidoon

Hi!

You are right, we use a "custom" training, validation, and test split with image pairs with little overlap (10%-40%). Here are some clarifications on how we generate the training, validation and test sets:

  1. For indoor datasets, we use the standard training, validation, and test splits, but we do sample our image pairs. If you are interested in the results of the standard ScanNet test set (the one in SuperGlue), we report them in the supplementary material (Table 5). For the outdoor scenes, we use MegaDepth scenes for training and validation. We remove scenes that are in the IMW and in the test set of the PhotoTourism dataset. As you noted, there are "only" two test scenes in MegaDepth, and hence, we use the PhotoTourism test as our testing scenes. Further details in section 5.2.

  2. For training and validation, we sample 90,000 and 30,000 image pairs, respectively. With our configuration, we didn't see further improvements when increasing the dataset size.

  3. As mentioned in answer 1), the training and validation sets are from different scenes than the test scenes we use to report the results in the paper tables.

Hope this helps, thanks!
Axel

Ah I think I understand now. Thanks a ton for the clarification! I'll close the issue.

Cheers,
Fereidoon

Hi there again,

If I may reopen the issue with another question about the data generation (please let me know if you prefer another channel other than github - e.g. email - for such questions):

I have tried running the provided pretrained model on a set of test samples that I generated from the ScanNet test split with pairs having 0.1 to 0.4 visual overlap score. However, I don't observe the same performance when I compare the metrics that I compute to the ones reported in the paper. Specifically I see noticeably worse performance in the translation component of the final solution after optimization (in median error by ~5 degrees, and in mAA by 0.05).

As these metrics depend a lot on the underlying pool of hypotheses, I'm wondering if you have some additional filtering/preprocessing steps when you produce the training/evaluation samples, for example to remove planar degenerate scenarios? I'm curious because in Figure 1 of appendix I see the error distributions only up to 90 degrees, whereas hypothesis errors can in theory reach 180 degrees.

Thanks in advance!

Best,
Fereidoon

Hey there,

Thanks for the follow-up question. Happy to have the conversation here, hopefully, it is also useful for others.

Regarding the drop in performance, I can think of a few reasons why that might be. To sample hypotheses, we use the USAC framework (from OpenCV). We follow the very same pipeline as in MAGSAC++, and hence, we also use all additional checks implemented within it. Besides that, all the hypotheses are refined, and that proves to improve the accuracy of the computed poses. As a side note, to refine the poses, we rely on MAGSAC++ inliers.

Regarding the distribution errors only reaching 90 degrees. That is only true for the translation error, and it is due to the ambiguity of the translation vector in the Essential/Fundamental matrix. See SuperGlue (angle_error_vec) and (compute_pose_error) for more details on how to handle that.

Please let me know if you have further questions. If you would find it useful, I could also add to the repo our test set, although it might take me a bit of time to clean up/prepare that data.

Thanks!

I'm closing this issue now since it did not have activity for a while. Do please feel free to reopen it if you have any other questions!

Thanks a lot!