This is a pipeline providing InstructBLIP multimodal operation for Vicuna family models running on oobabooga/text-generation-webui.
Clone this repo into your extensions/multimodal/pipelines folder and run the server with --multimodal enabled and a preferred pipeline. Use AutoGPTQ to load.
> cd text-generation-webui
> cd extensions/multimodal/pipelines
> git clone https://github.com/kjerk/instructblip-pipeline
> cd ../../../
> python server.py --auto-devices --chat --listen --loader autogptq --multimodal-pipeline instructblip-7b
-
AutoGPTQ loader (ExLlama is not supported for multimodal)
-
No additional dependencies from textgen-webui
instructblip-7b + vicuna-7b |
~6GB VRAM |
instructblip-13b + vicuna-13b |
11GB VRAM |
The vanilla Vicuna-7b + InstructBLIP just barely runs on a 24GB gpu using huggingface transformers directly, and the 13b at fp16 is too much, thanks to optimization efforts and Quantized models/AutoGPTQ, on textgen-webui with AutoGTPQ, InstructBLIP and Vicuna can comfortably run on 8GB to 12gb of VRAM. 🙌
-
'instructblip-7b' for Vicuna-7b family
-
'instructblip-13b' for Vicuna-13b family
-
instructblip-7b
-
instructblip-13b
-
vicuna-13b-v1.1-4bit-128g (Standard)
-
vicuna-13b-v0-4bit-128g (Outmoded)
-
-
wizard-vicuna-13b-4bit-128g
Due to the already heavy VRAM requirements of the respective models, the vision encoder and projector are kept on CPU and are relatively quick, while the Qformer is moved to GPU for speed.
-
/Salesforce (Fullsize reference Vicuna 1.1 models)
-
❔ Allow for GPU inference of the image encoder and projector?
-
❔ Consider multiple embeddings causing problems and remediations.
This pipeline echoes through the LAVIS license and is published under the BSD 3-Clause OSS license.