sshh12/multi_token

Thank you for posting this!

Closed this issue · 3 comments

I'm still learning how these systems work, and stumbled upon LLaVa recently and learned how they can essentially train by concatenating text embeddings + clip embeddings. My first thought was: doesnt this mean that anything that can be described and annotated, be trained with an llm?

Thank you, this is a huge learning experience you've provided.

Now to find an image -> React + CSS dataset to train it on

doesnt this mean that anything that can be described and annotated, be trained with an llm?

Pretty much! I wrote a blog post here if you are interested where I talk about some of the other cool things one could do with this https://blog.sshh.io/p/large-multimodal-models-lmms

Now to find an image -> React + CSS dataset to train it on

This would totally work in theory. The only reason I have not done this is that "React + CSS" is fairly context-window-intensive so training would require a large VRAM GPU. And ofc you need a dataset.

Do you have the video + caption dataset available anywhere?