What are the C and P dimensions?

Rex, in your paper you refer to the C (or C^k) dimension, but I can't find a reference as to what this C is. Is it the embedding dimension?

Also, the code refers to a value P, as in B x CK x [HW/P] - Query keys. I'm assuming HW is image height and width, but what is P?

I'm working on strategies to reduce Cutie's memory requirements for high resolution images, but the dimensionality of the similarity/affinity matrix is really severe, so I'm looking for any opportunities to reduce this.

Hi.

In code, C in isolation denotes some channel size -- the exact meaning is context-dependent. In the paper, C is a shared channel size for most of the operations, except the key tensor (which is C^k). See

Cutie/cutie/config/model/base.yaml

Lines 4 to 8 in 2ac7ac2

    
           pixel_dim: 256 
        
           key_dim: 64 
        
           value_dim: 256 
        
           sensory_dim: 256 
        
           embed_dim: 256

where C^k is 64, and all the other 256 jointly refer to C. We experimented with different values before (and thus allowed the config to set them differently) but just found that it's easier to tie them to a single value.

For P, it is a value inherited from XMem. It denotes the number of prototypes (Section 3.3 of XMem). Semantically [HW/P] denotes the total number of query elements. During memory reading, it would be the number of pixels HW, and during memory potentiation, it would be the number of prototypes.

Ah [HW/P] is HW or P, not HW divided by P. I see, thank you.

	pixel_dim: 256
	key_dim: 64
	value_dim: 256
	sensory_dim: 256
	embed_dim: 256