hkchengrex/Cutie

What are the C and P dimensions?

Closed this issue · 2 comments

Rex, in your paper you refer to the C (or C^k) dimension, but I can't find a reference as to what this C is. Is it the embedding dimension?

Also, the code refers to a value P, as in B x CK x [HW/P] - Query keys. I'm assuming HW is image height and width, but what is P?

I'm working on strategies to reduce Cutie's memory requirements for high resolution images, but the dimensionality of the similarity/affinity matrix is really severe, so I'm looking for any opportunities to reduce this.

Hi.

In code, C in isolation denotes some channel size -- the exact meaning is context-dependent. In the paper, C is a shared channel size for most of the operations, except the key tensor (which is C^k). See

pixel_dim: 256
key_dim: 64
value_dim: 256
sensory_dim: 256
embed_dim: 256

where C^k is 64, and all the other 256 jointly refer to C. We experimented with different values before (and thus allowed the config to set them differently) but just found that it's easier to tie them to a single value.

For P, it is a value inherited from XMem. It denotes the number of prototypes (Section 3.3 of XMem). Semantically [HW/P] denotes the total number of query elements. During memory reading, it would be the number of pixels HW, and during memory potentiation, it would be the number of prototypes.

Ah [HW/P] is HW or P, not HW divided by P. I see, thank you.