Large Model Proxy

Large Model Proxy is designed to make it easy to run multiple resource-heavy Large Models (LM) on the same machine with limited amount of VRAM/other resources. It listens on a dedicated port for each proxied LM, making them always available to the clients connecting to these ports.

How it works

Upon receiving a connection on a specific port, if LM is not already started, it will:

Verify if the required resources are available to start the corresponding LM.
If resources are not available, it will automatically stop the least recently used LM to free up contested resources.
Start LM.
Wait for LM to be available on the specified port.
Wait for 2 seconds (This will be replaced with a support for a proper healthcheck).

Then it will proxy the connection from the LM to the client that connected to it. To the client this should be fully transparent with the only exception being that receiving any data on the connection takes longer if LM had to be spun up.

Installation

Ubuntu and Debian: Download the deb file attached to the latest release.

Arch Linux: Install from AUR.

Other Distros:

Install go

Clone the repository:

git clone https://github.com/perk11/large-model-proxy.git

Navigate into the project directory:
```
cd large-model-proxy
```

Build the project:

go build -o large-model-proxy main.go

 make

Configuration

Below is an example config.json:

{
  "MaxTimeToWaitForServiceToCloseConnectionBeforeGivingUpSeconds": 1200,
  "ShutDownAfterInactivitySeconds": 120,
  "ResourcesAvailable": {
     "VRAM-GPU-1": 24000,
     "RAM": 32000
  }, 
  "Services": [
    {
      "Name": "automatic1111",
      "ListenPort": "7860",
      "ProxyTargetHost": "localhost",
      "ProxyTargetPort": "17860",
      "Command": "/opt/stable-diffusion-webui/webui.sh",
      "Args": "--port 17860",
      "WorkDir": "/opt/stable-diffusion-webui", 
      "ShutDownAfterInactivitySeconds": 600,
      "RestartOnConnectionFailure": true,
      "ResourceRequirements": {
        "VRAM-GPU-1": 6000,
        "RAM": 30000
      }
    },
    {
      "Name": "assistant",
      "ListenPort": "8081",
      "ProxyTargetHost": "localhost",
      "ProxyTargetPort": "18081",
      "Command": "/opt/llama.cpp/llama-server",
      "Args": "-m /opt/Gemma-27B-v1_Q4km.gguf -c 8192 -ngl 100 -t 4 --port 18081",
      "HealthcheckCommand": "curl --fail http://localhost:18081/health", 
      "HealthcheckIntervalMilliseconds": 200,
      "RestartOnConnectionFailure": false,
      "ResourceRequirements": {
        "VRAM-GPU-1": 20000,
        "RAM": 3000
      }
    }
  ]
}

This configuration will run automatic1111's Stable Diffusion web UI on port 7860 and llama.cpp on port 8080. large-model-proxy will expect these services to be available on port 17860 and 18080 once started. It allocates up to 24GB of VRAM and 32GB of RAM for them. The Stable Diffusion web UI is expected to use up to 6GB of VRAM and 30GB of RAM, while llama.cpp will use up to 20GB of VRAM and 3GB of RAM.

"ResourcesAvailable" can include any resource metrics, CPU cores, multiple VRAM values for multiple GPUs, etc. these values are not checked against actual usage.

Usage

./large-model-proxy -c path/to/config.json

If -c argument is omitted, large-model-proxy will look for config.json in current directory

Logs

Output from each service is logged to a separate file. Default behavior is to log it into logs/{name}.log, but it can be redefined by specifying LogFilePath parameter for each service.