Thank you for posting this!
Closed this issue · 3 comments
I'm still learning how these systems work, and stumbled upon LLaVa recently and learned how they can essentially train by concatenating text embeddings + clip embeddings
. My first thought was: doesnt this mean that anything that can be described and annotated, be trained with an llm?
Thank you, this is a huge learning experience you've provided.
Now to find an image -> React + CSS
dataset to train it on
doesnt this mean that anything that can be described and annotated, be trained with an llm?
Pretty much! I wrote a blog post here if you are interested where I talk about some of the other cool things one could do with this https://blog.sshh.io/p/large-multimodal-models-lmms
Now to find an image -> React + CSS dataset to train it on
This would totally work in theory. The only reason I have not done this is that "React + CSS" is fairly context-window-intensive so training would require a large VRAM GPU. And ofc you need a dataset.
Do you have the video + caption dataset available anywhere?