Will consider using Web images train a more robust version?

Question

Will consider using Web images train a more robust version?

Closed this issue 6 months ago · 11 comments

MonolithFoundation commented 6 months ago

From magvit2 original paper, they got a conssitent good result with larger web iamges training data

Answer 1 · 2024-06-18T18:15:44.000Z

Thanks for your interest in our work. In the initial version, we use ImageNet for fair comparison with other VQ methods. It will be beneficial to get better reconstruction performance with larger training data (e.g., Open-Images and Laion dataset). We are planning to scale up the data size, and update the results in the near future.

Answer 2 · 2024-06-19T01:18:08.000Z

thanks looks awesome. in the meantime would like to ask is that possible using the magvit for understanding?

…

---- Replied Message ---- | From | Fengyuan ***@***.***> | | Date | 06/19/2024 02:16 | | To | ***@***.***> | | Cc | ***@***.***>***@***.***> | | Subject | Re: [TencentARC/Open-MAGVIT2] Will consider using Web images train a more robust version? (Issue #2) | Thanks for your interest in our work. In the current version, we use ImageNet for training and testing to compare with other VQ methods. It will be beneficial to get better reconstruction performance with larger training data (e.g., Open-Images and Lion dataset). We are planning to scale up the data size, and update the results in the future. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2024-06-19T05:02:02.000Z

Hi @MonolithFoundation ,
We haven't tried it, but we anticipate some potential difficulties. The magvit2 tokenizer captures low-level semantics compared to CLIP-series encoders, as it is trained for reconstruction targets. This might result in inferior performance when used directly for understanding tasks. However, there should be solutions available, which are also crucial for properly unifying understanding and generation in the next generation of multimodal foundation models.

Answer 4 · 2024-06-19T11:07:46.000Z

@yxgeee thanks for the feedback.

I also agree there should be solutions, that's also where am reaching at for unifying understanding and generation.

Currently I would be brutely give it a try, but found the magivt2 outputs just Look Up Free features from encoder, so that the features are binary (switch -1 to 1), want ask for some help for:

how to deal with it, what could be the properly way to feed it into LLM's conditional embed feature?
What kinds of else way it could be for the making use of magvit2 - currently best image tokenizer?

Answer 5 · 2024-06-19T16:55:46.000Z

Hi @MonolithFoundation ,

As done in their original paper, the binary features produced by the MAGVIT2 tokenizer are only used to provide token indices and the embedding of each index needs to be further learned together with LLM (known as the visual vocabulary). Specifically, each 18-dim binary feature can represent an index range from [0, $2^{18}$). If there are $16 \times 16$ tokens for one image, 256 tokens would be used to represent it in LLM where their indices are produced by the MAGVIT2 encoder and the corresponding vocabulary embeddings are learned in LLM.
We are also exploring it. We can discuss it further if we have better solutions.

Answer 6 · 2024-06-20T03:29:53.000Z

@yxgeee thanks for the indications.

From the previous work such as SEED and LaVIT, looks like they directly concat the codebook indices with <img_0> <img_1> .. etc and embed it in sentences directly, without adding embedding condition.

Also, in llava, they just using -200 represent the image id, and just using the featuremap to fill the embedding actually.

So that, what could a properway to send codebook indices to LLM? like SEED does? If we send the indices into LLM, does the embedding is necessary?

Answer 7 · 2024-06-20T04:08:54.000Z

@MonolithFoundation checkout Chameleon from meta code is opensourced github
i have trid early fusion training: image patch (without CLIP encoder) mixed with text token, like Fuyu-8B, it's very hard to train, it's unstable when model parameter increases. Meta Chameleon on the other hand using vision tokenizer and some other tricks has proven this method have higher potential

Answer 8 · 2024-06-20T05:11:50.000Z

@MonolithFoundation Yes, you can refer to SEED. The <img_0> <img_1>... are exactly the indices I mentioned. The difference is that SEED uses 8192 visual vocabulary while MAGVIT2 uses $2^{18}$. You can also refer to Chameleon's code (mentioned by @eisneim ). I think it should be similar. However, SEED provides full training code while Chameleon not.

Answer 9 · 2024-06-20T06:43:56.000Z

@yxgeee From I can tell with SEED inference code, it just encode raw <img_0> .. with text tokenizer, and without conditional embedding feed into.

Does same way applicable to magvit2?

Also, i just realise that magvit2 produces too much tokens at this moment, it's about 18x17x26 if my input resolution about 512 size.

That's for understanding is too much.

Answer 10 · 2024-06-20T07:13:13.000Z

Hi, @MonolithFoundation , May I ask what size of the image you input in the tokenizer? We also release the training code for transformer training. As you can see here, https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/models/cond_transformer.py#L99. In the stage of transformer training, each input image will be tokenized into H' W' 1, where 1 specifies the index. The 18bit will be transformed into a numerical number, as you can check https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/modules/vqvae/lookup_free_quantize.py#L260.

Answer 11 · 2024-07-02T03:53:22.000Z

Hi. my inputsize is 672 x 672. is it big for magvit2?

…

---- Replied Message ---- | From | Fengyuan ***@***.***> | | Date | 07/01/2024 19:02 | | To | ***@***.***> | | Cc | ***@***.***>***@***.***> | | Subject | Re: [TencentARC/Open-MAGVIT2] Will consider using Web images train a more robust version? (Issue #2) | Closed #2 as completed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>