Intuition on "4.1. Applying Transformations"
ShengyuH opened this issue · 2 comments
hi Brent,
Thanks for sharing this very interesting work! I have one question about how you apply transformation to the learnable factors elaborated in section 4.1. I understand it as the following: Let's say we have 3 planes and 64 channels each, and sample 8 transformations, then each transformation is responsible for 6/64 channels. This means we project the queried points to 8 different groups of channels with 8 different transformation matrices, then we concatenate this 8 groups of projected features. The hope is that at least 1/8 groups of features correspond to the "canonical factors". This is very confusing to me as I thought you would take a RANSAC-like method that treats 8 transformation matrices equally and only takes the one with the best reconstruction loss. What you did actually rely on the assumption that 1/8 sampled canonical factors have sufficient representation capacity to recover the structure of the scene. Correct me if my understanding is wrong and I am looking forward to hearing your thoughts.
best regards,
Shengyu
Hi Shengyu,
Thanks for the excellent question! This is something we've discussed quite a bit.
One advantage of the current approach over trying to find a single best transform is that there's often more than one set of "canonical orientations" associated with a single scene — an example of this is the kitchen nerfstudio scene, which has a lot of stuff in it. Here's a render from a 64 channel, 8 transform K-Planes model:
When we unpack the transforms, one aligns itself to the piano. Norms for features for just that transform:
And another aligns itself more to some of the chairs / furniture. Norms for features for just that transform:
The splitting/concatenation approach we use makes it easy to reflect many structures in a given scene, which can be present even for more object-centric scenes.
What you did actually rely on the assumption that 1/8 sampled canonical factors have sufficient representation capacity to recover the structure of the scene.
There are two reasons for this design decision:
- As our theoretical results try to illustrate, aligned factors can be extremely compact in terms of rank/channel count. Our representation implicitly prioritizes more aligned factors with fewer channels over fewer aligned factors with more channels.
- If 1/8 of the channels are insufficient, other transforms can step in to pick up the slack. You can see this a bit in the animated fox example on our website: one transform converges to the rightward direction quickly, and then another one converges to the upward direction soon after. Due to symmetry, these are actually equivalent.
That said, a bias for simplicity (and, less concretely, for more "self-arranging" systems) also played a role in decisions like this and I'm sure there's an enormous amount of room for improving results. I fully agree that a RANSAC-type extension makes sense: even basic heuristics for things like culling transforms or picking only the N best transforms from our bottleneck phase would be very interesting to me.
hi Brent,
Thanks for this very informative reply! Yeah using multiple canonical factors to represent the complex scene makes sense to me now. Such operation is safe as long we have redundancy in the channels(I guess that's also F in the denominator trying to highlight).
Again congrats on this very interesting work! Some of my colleagues working on Tensor Train Decomposition and they struggled to really model the "rotated" scenes, I think this work is a quite a step towards promoting more compact scene representation!
Best,
Shengyu