Python-based WebSocket for CLI LLaVA inference. The WebSocket server receives prompts and images from other clients and returns the CLI-inference results via the socket connection.
This project is a quick and simple implementation of LLaVA (Website, GitHub) CLI-based inference via a Python WebSocket. It is based on LLaVA's own cli.py
, adding WebSocket capabilites.
When running python llava-websocket.py
, the checkpoint shards are loaded and stay in cache while a WebSocket server is started. Clients in your local network can send new prompts and images for inference without having to re-load checkpoint shards and without using Gradio.
This project, in its current form, is not designed for conversation, but rather for one-time prompt processing, enabling the inference of various images and prompts based on individual requests. I might add conversation capabilities in the near future.
This project was tested on Ubuntu.
You should follow the LLaVA tutorial, so that you have the pretrained model / checkpoint shards ready. Then, put my script into your LLaVA directory and start it while in the LLaVA conda-environment (conda activate llava
).
python llava-websocket.py [ARGS]
Given that this project is based on LLaVA's cli.py
, the following base arguments can be specified:
--model-path, default="liuhaotian/llava-v1.5-13b"
--model-base, default=None
--device, default="cuda"
--conv-mode, default=None
--temperature, default=0.2
--max-new-tokens, default=512
--load-8bit, action="store_true"
--load-4bit, action="store_true"
--debug, action="store_true"
Additionally added args:
--port, default=1995
--verbose, action="store_true"
--json, action="store_true"
Using --port [int]
, you can specify your own WebSocket port.
Using --verbose
, you will receive verbose output on the server-side console (WebSocket connection info, transmitted inference results).
Using --json
, the WebSocket responses will be formatted as JSON, containing a timestamp and the inference result:
{
"time": "11:48:52.632415",
"result": "The image features a wooden pier (...)"
}
In your local network, clients can access the WebSocket via ws://[your-ip]:1995
. Specify your own port when calling the python script via --port [int]
.
You can probably also make the WebSocket accessible from outside of your LAN via port forwarding.
The WebSocket server waits for a JSON object:
{
"prompt": "Here goes your prompt (e.g., describe this image.)",
"image": "URL to an image file, or BASE64-encoded string of an image file"
}
and returns the written inference result via the socket connection.
Demo.LLaVA.WebSocket.mp4
The input image is LLaVA's test image:
(Source: https://llava-vl.github.io/static/images/view.jpg)This project is a prototype and serves as a basic example of using a WebSocket for LLaVA's CLI inference. Feel free to create a PR.