Any recommended code for converting mmc4 into WebDataset format instead of jsonl format?

Question

Any recommended code for converting mmc4 into WebDataset format instead of jsonl format?

roboswell opened this issue 2 years ago · 8 comments

I noticed that when downloading mmc4-ff, it downloads jsonl files. However, the Open Flamingo model requires dataset shards for training to be in WebDataset format. Could you please recommend code for converting jsonl database files into WebDataset shard format?

Answer 1 · 2023-04-17T18:05:48.000Z

cc @anas-awadalla

Answer 2 · 2023-04-17T19:02:28.000Z

Yes will share a script soon :)

Answer 3 · 2023-04-19T19:48:52.000Z

I have added the script here thank you!

Answer 4 · 2023-05-11T18:14:15.000Z

@anas-awadalla Presently the script you wrote only allows for 2 inputs as arguments (image_shards and doc_shards). Will you be modifying the script soon to allow for CLIP feature shards rather than image_shards?
Thanks!

Answer 5 · 2023-05-11T19:27:51.000Z

The CLIP features are not suitable for training Flamingo models so for now I will be keeping it as is. My suggested workflow be to download raw images using this script and then convert those to webdataset shards.

Answer 6 · 2023-05-16T21:42:04.000Z

Hi @anas-awadalla, could you help me understand more why the CLIP features for mmc4 (downloadable from https://storage.googleapis.com/ai2-jackh-mmc4-public/images/clip_vitl14_shard_{$SHARD}_features.pkl) are unable to be used for training even though they were (I assume) the same CLIP features you used to train the Open Flamingo 9B vision encoder?

Answer 7 · 2023-05-18T19:45:20.000Z

Yep. First, I apologize for the confusion regarding the CLIP embeddings (I think I mentioned they could be used to train flamingo models in an OpenFlamingo issue). This was a misunderstanding on my end. What you will need to to create the image tokens for Flamingo is the patch embeddings from the vision encoder of CLIP. However, the embeddings in mmc4 are the projection vector of the image to the multimodal space.

One thing I want to point out is that we do not train any vision encoder and instead use this pre-trained CLIP model.

Answer 8 · 2023-06-12T21:58:17.000Z

closing this as addressed, feel free to re-open if I'm misreading