Thank you for posting this!

Question

Thank you for posting this!

Closed this issue 8 months ago · 3 comments

I'm still learning how these systems work, and stumbled upon LLaVa recently and learned how they can essentially train by concatenating text embeddings + clip embeddings. My first thought was: doesnt this mean that anything that can be described and annotated, be trained with an llm?

Thank you, this is a huge learning experience you've provided.

Now to find an image -> React + CSS dataset to train it on

Answer 1 · 2024-01-15T20:38:48.000Z

doesnt this mean that anything that can be described and annotated, be trained with an llm?

Pretty much! I wrote a blog post here if you are interested where I talk about some of the other cool things one could do with this https://blog.sshh.io/p/large-multimodal-models-lmms

Now to find an image -> React + CSS dataset to train it on

This would totally work in theory. The only reason I have not done this is that "React + CSS" is fairly context-window-intensive so training would require a large VRAM GPU. And ofc you need a dataset.

Answer 2 · 2024-02-07T06:29:02.000Z

Do you have the video + caption dataset available anywhere?

Answer 3 · 2024-02-09T03:41:24.000Z

I have the scripts to generate it: