NielsRogge/Transformers-Tutorials

Img2Txt model fine-tuning with huge captions

Opened this issue · 0 comments

I want to create a model that takes a screenshot of a front page and answers with the HTML code and JS. As you can tell the "input_ids" will be super long > 4096 tokens.

I was thinking of training a Blip2 model, but how can I efficiently train a model like this?
Thanks