amyxlase/relpose-plus-plus

Maximum of the input num_images

fearless-pilgrim opened this issue · 4 comments

Hi~ What brilliant work. I have read relpose++ these days, and after running your code, there is a question around me, what is the maximum input num_images in the transformer encoder, if the image sequence is so big, that 8 images can not cover the object enough?

Although trained with positional encodings with up to 8 images, we found that our method can still work for 20 images or more. The main bottleneck is the GPU memory necessary to run the transformer.

I'm not sure I understand what you mean by 8 images being insufficient to cover the object. Is the object not fully in the frame for each image? If the object is large, then the images should probably be taken from further back to ensure that it is fully visible.

Although trained with positional encodings with up to 8 images, we found that our method can still work for 20 images or more. The main bottleneck is the GPU memory necessary to run the transformer.

I'm not sure I understand what you mean by 8 images being insufficient to cover the object. Is the object not fully in the frame for each image? If the object is large, then the images should probably be taken from further back to ensure that it is fully visible.

I mean sometimes, the symmetry object may confuse the network detection, needing to supply more detailed views in the input image sequence, so, is there any way such as divide the input sequence into a few parts to solve a large number of input images?

That is not something we have explored in depth. Perhaps one way to approach the problem is to choose a set of five or so images as "keyframes," and then use those keyframes with a sliding window of new frames.

OK, I guess I have got your thought, thanks for your answer patiently!