TencentARC/SEED-Voken

Will consider using Web images train a more robust version?

Closed this issue · 11 comments

From magvit2 original paper, they got a conssitent good result with larger web iamges training data

image

Thanks for your interest in our work. In the initial version, we use ImageNet for fair comparison with other VQ methods. It will be beneficial to get better reconstruction performance with larger training data (e.g., Open-Images and Laion dataset). We are planning to scale up the data size, and update the results in the near future.

Hi @MonolithFoundation ,
We haven't tried it, but we anticipate some potential difficulties. The magvit2 tokenizer captures low-level semantics compared to CLIP-series encoders, as it is trained for reconstruction targets. This might result in inferior performance when used directly for understanding tasks. However, there should be solutions available, which are also crucial for properly unifying understanding and generation in the next generation of multimodal foundation models.

@yxgeee thanks for the feedback.

I also agree there should be solutions, that's also where am reaching at for unifying understanding and generation.

Currently I would be brutely give it a try, but found the magivt2 outputs just Look Up Free features from encoder, so that the features are binary (switch -1 to 1), want ask for some help for:

  1. how to deal with it, what could be the properly way to feed it into LLM's conditional embed feature?
  2. What kinds of else way it could be for the making use of magvit2 - currently best image tokenizer?

Hi @MonolithFoundation ,

  1. As done in their original paper, the binary features produced by the MAGVIT2 tokenizer are only used to provide token indices and the embedding of each index needs to be further learned together with LLM (known as the visual vocabulary). Specifically, each 18-dim binary feature can represent an index range from [0, $2^{18}$). If there are $16 \times 16$ tokens for one image, 256 tokens would be used to represent it in LLM where their indices are produced by the MAGVIT2 encoder and the corresponding vocabulary embeddings are learned in LLM.
  2. We are also exploring it. We can discuss it further if we have better solutions.

@yxgeee thanks for the indications.

From the previous work such as SEED and LaVIT, looks like they directly concat the codebook indices with <img_0> <img_1> .. etc and embed it in sentences directly, without adding embedding condition.

Also, in llava, they just using -200 represent the image id, and just using the featuremap to fill the embedding actually.

So that, what could a properway to send codebook indices to LLM? like SEED does? If we send the indices into LLM, does the embedding is necessary?

@MonolithFoundation checkout Chameleon from meta code is opensourced github
i have trid early fusion training: image patch (without CLIP encoder) mixed with text token, like Fuyu-8B, it's very hard to train, it's unstable when model parameter increases. Meta Chameleon on the other hand using vision tokenizer and some other tricks has proven this method have higher potential

@MonolithFoundation Yes, you can refer to SEED. The <img_0> <img_1>... are exactly the indices I mentioned. The difference is that SEED uses 8192 visual vocabulary while MAGVIT2 uses $2^{18}$. You can also refer to Chameleon's code (mentioned by @eisneim ). I think it should be similar. However, SEED provides full training code while Chameleon not.

@yxgeee From I can tell with SEED inference code, it just encode raw <img_0> .. with text tokenizer, and without conditional embedding feed into.

Does same way applicable to magvit2?

Also, i just realise that magvit2 produces too much tokens at this moment, it's about 18x17x26 if my input resolution about 512 size.

That's for understanding is too much.

Hi, @MonolithFoundation , May I ask what size of the image you input in the tokenizer? We also release the training code for transformer training. As you can see here, https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/models/cond_transformer.py#L99. In the stage of transformer training, each input image will be tokenized into H' W' 1, where 1 specifies the index. The 18bit will be transformed into a numerical number, as you can check https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/modules/vqvae/lookup_free_quantize.py#L260.