wilson1yan/VideoGPT-Paper

embedding size is not a multiple of 3?

Closed this issue · 4 comments

Hi, I'm trying to reproduce the results in the paper but I had some trouble using the pos embedding with provided transformer embedding sizes. The paper use embedding sizes of 1024 and 512, but according to this line

assert embd_dim % n_dim == 0, f"{embd_dim} % {n_dim} != 0"
, the size should be a multiple of n_dim, which is supposed to be 3? Am I missing anything? Thanks!

I simplified some of the code and removed some parts related to using multiple codebooks, so there were 4 dimensions in the input to the transformer (T, H, W, # codebooks), hence 512 / 1024. It should work to just choose some multiple of 3 near 512 / 1024 and should produce very similar results.

Interesting... I think that makes sense to me. A question irrelevant to the thread but relevant to your response, what is the point to try more than one codebooks other than increase the number of code in the codebook?

In general, using multiple codebooks is easier to optimize / less prone to codebook collapse. It also allows more (combinatorially) more expressivity in the codebook with linear scaling in the # of codes when compared to just using 1 codebook.

Oh these are good points! Thank you for your patient reply! : )