a simple multi-modal vision language model that discribes a image with only keywords
- image dataset prepaeration ☑
- text dataset preparation ◻︎
- nano language model ◻︎
- openCLIP b/32 projection layer ◻︎
- supervised vs instruction fine tuning ◻︎
- usage examples ◻︎
- export to ONNX ◻︎
- add WASM for javascript support ◻︎