eisneim/nanoVLM

A simple multi-modal vision-language model that describes an image using only keywords.

PythonApache-2.0

nanoVLM

a simple multi-modal vision language model that discribes a image with only keywords

!! currently WORKING IN PROGRESS

Roadmap

image dataset prepaeration ☑
text dataset preparation ◻︎
nano language model ◻︎
openCLIP b/32 projection layer ◻︎
supervised vs instruction fine tuning ◻︎
usage examples ◻︎
export to ONNX ◻︎
add WASM for javascript support ◻︎