TencentARC/SEED-Voken

Why not use `Adaptive GroupNorm` On Decoder & Sampler asymmetrical?

Closed this issue · 6 comments

  1. Does the GourpNorm Shortcut LFQ Embedding trick not work very well?
    screenshot-20240716-150109
class Encoder &class Decoder  Sampler
....
 **for i_level in range(self.num_blocks):**
            block = nn.ModuleList()
            block_in = ch*in_ch_mult[i_level] #[1, 1, 2, 2, 4]
            block_out = ch*ch_mult[i_level] #[1, 2, 2, 4]
            for _ in range(self.num_res_blocks):
                block.append(ResBlock(block_in, block_out))
                block_in = block_out
        
            down = nn.Module()
            down.block = block
            **if i_level < self.num_blocks - 1:
                down.downsample = nn.Conv2d(block_out, block_out, kernel_size=(3, 3), stride=(2, 2), padding=1)**

            self.down.append(down)
....
 **for i_level in reversed(range(self.num_blocks)):**
            block = nn.ModuleList()
            block_out = ch*ch_mult[i_level]
            for i_block in range(self.num_res_blocks):
                block.append(ResBlock(block_in, block_out))
                block_in = block_out
            
            up = nn.Module()
            up.block = block
            **if i_level > 0:
                up.upsample = Upsampler(block_in)**
            self.up.insert(0, up)

means

Down   : Y   Y   Y  Y   N
BlockNum:0   1   2  3   4
Up      :Y   Y   Y   Y  N
BlockNum:4   3   2   1  0

Are Encoder downsampling and Decoder upsampling asymmetrical at the same Dim?

Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update!
I do not understand the second question quite well. Can you make it more specific?

Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?

First of all, thank you for your answer to the first question.
Regarding the second question, the entire codec is constructed with 5 layers of Resnet + (W/O) Down/Up sampler,
where the encoder downsamples 2 times in each layer from 0 to 3, a total of 16 times length and width compression, and the 4th layer, which is the layer with the highest dimension, is not downsampled;
However, the decoder reversely starts upsampling from the 4th layer to the 1st layer, and the last 0th layer is not downsampled;
The final result is an asymmetric compression ratio codec, which is inconsistent with the 4x8x8 structure compression in the Magvit2 appendix. From the above appendix figure, it can be seen that the 4th layer of the decoder, which is closest to the discrete part, does not have T-Causal upsampling operation.

By the way, how is this conditional discriminator designed?
I see that the NLayerDiscriminator section does not have any implementation for adding cond to forward.

Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?

@RobertLuo1 The full-gear generator is still within image tokenization? Or you will also reproduce the video tokenizer?

@shinshiner Hi, Currently, we still operate on Image tokenization and the subsequent AutoRegressive Generation. Later we will continue on upgrading the tokenizer into Video.