mostlygeek/llama-swap

Proxy does not set content length.

Closed this issue · 5 comments

When attempting to set llama-swap using koboldcpp as a back-end, I noticed that the proxy does not set the content-length header when making HTTP requests. It appears that net/http will used chunked-encoding when creating a request with an io.Reader:

    Host: localhost:8999
    User-Agent: node-fetch
    Transfer-Encoding: chunked
    Accept: */*
    Accept-Encoding: gzip, deflate, br
    Authorization: Bearer
    Connection: keep-alive
    Content-Type: application/json

    {
        "frequency_penalty": 0,
        "logit_bias": {},
        "max_tokens": 300,
        "messages": [
            {
                "content": "Hi",
                "role": "user"
            }
        ],
        "model": "koboldcpp/Llama-3.2-1B-Instruct-Q4_K_M",
        "presence_penalty": 0,
        "stream": false,
        "temperature": 1,
        "top_p": 1
    }

 << HTTP/1.0 500 Internal Server Error 136b
    Server: ConcedoLlamaForKoboldServer
    Date: Tue, 19 Nov 2024 00:53:58 GMT
    access-control-allow-origin: *
    access-control-allow-methods: *
    access-control-allow-headers: *, Accept, Content-Type, Content-Length, Cache-Control, Accept-Encoding, X-CSRF-Token, Client-Agent, X-Fields, Content-Type, Authorization, X-Requested-With, X-HTTP-Method-Override, apikey, genkey
    cache-control: no-store
    content-type: application/json

    {
        "detail": {
            "err": "the JSON object must be str, bytes or bytearray, not NoneType",
            "msg": "Error parsing input.",
            "type": "bad_input"
        }
    }

It appears that koboldcpp does not support chunked-encoding and requires an explicit content-length header:

https://github.com/LostRuins/koboldcpp/blob/fedc3874bd54ad7fd43f55ae52595ffb0144afc4/koboldcpp.py#L1956

I was able to work-around this by patching my local copy of llama-swap to copy the request body to a fixed size buffer before passing that to the new request. I am not very experienced with Go, so I am not sure if there is a better way to solve this.

what client are you using? If you test it with curl does it still happen with kobaldcpp? The proxy code is rather naive and just sends the same headers as it got.

I don’t think it would be too difficult to de-chunk the request to upstream.

I was testing with SillyTavern as a front end. The problem appears to be the request from the proxy to the backend.

The request when using SillyTavern->kobold directly:

accept: */*
accept-encoding: gzip, deflate, br
authorization: Bearer
content-length: 211
content-type: application/json
user-agent: node-fetch
Host: localhost:8999
Connection: keep-alive

{
	"frequency_penalty": 0,
	"logit_bias": {},
	"max_tokens": 300,
	"messages": [
		{
			"content": "Hi",
			"role": "user"
		}
	],
	"model": "koboldcpp/Llama-3.2-1B-Instruct-Q4_K_M",
	"presence_penalty": 0,
	"stream": false,
	"temperature": 1,
	"top_p": 1
}

The request from the proxy->kobold:

Host: localhost:8999
User-Agent: node-fetch
Transfer-Encoding: chunked
Accept: */*
Accept-Encoding: gzip, deflate, br
Authorization: Bearer
Connection: keep-alive
Content-Type: application/json

{
	"frequency_penalty": 0,
	"logit_bias": {},
	"max_tokens": 300,
	"messages": [
		{
			"content": "Hi",
			"role": "user"
		}
	],
	"model": "koboldcpp/Llama-3.2-1B-Instruct-Q4_K_M",
	"presence_penalty": 0,
	"stream": false,
	"temperature": 1,
	"top_p": 1
}

I do not believe the issue is on the receiving side, but rather the sending side of the proxy. From what I can tell, when creating the request using an io.Reader, the size is not known ahead of time, and net/http will used chunked encoding.

It appears the r.Body coming from the incoming request is an io.Reader:

https://github.com/mostlygeek/llama-swap/blob/main/proxy/process.go#L193

I was able to patch this out with a quick hack like this:

--- a/proxy/process.go
+++ b/proxy/process.go
@@ -12,6 +12,7 @@ import (
        "sync"
        "syscall"
        "time"
+       "bytes"
 )
 
 type Process struct {
@@ -190,12 +191,22 @@ func (p *Process) defaultProxyHandler(w http.ResponseWriter, r *http.Request) {
 
        proxyTo := p.config.Proxy
        client := &http.Client{}
-       req, err := http.NewRequest(r.Method, proxyTo+r.URL.String(), r.Body)
+
+       bbuf := &bytes.Buffer{}
+       nRead, err := io.Copy(bbuf, r.Body)
+       if err != nil {
+               fmt.Println(err)
+       }
+
+       req, err := http.NewRequest(r.Method, proxyTo+r.URL.String(), bbuf)
        if err != nil {
                http.Error(w, err.Error(), http.StatusInternalServerError)
                return
        }
+
        req.Header = r.Header
+       req.ContentLength = nRead
+
        resp, err := client.Do(req)
        if err != nil {
                http.Error(w, err.Error(), http.StatusBadGateway)

I am not sure if we should force a known size buffer here, so we can have a known content-length. It might also make sense to fix Koboldcpp to support chunked encoding. I submitted a PR for that here:

LostRuins/koboldcpp#1226

Thanks for doing such deep investigation!

I did some testing of this with this curl command and from my testing the chunked encoding propagates all the way through to llama.cpp.

curl http://localhost:8080/v1/chat/completions -N \
     -H "Transfer-Encoding: chunked" \
     -H "Content-Type: application/json" \
     --data-binary '{"messages":[{"role":"user","content":"write a short story"}],"stream":false, "model":"llama","max_tokens":100}'

Since llama-swap has to get all the body bytes to detect the model it can send the request to the upstream without using Transfer-Encoding: chunked. This is the preferred behaviour.

I pushed the changes to the main branch. Give it a try and see if it improves the situation.

The change appears to be working ok for me. I will test some more. Thanks for attacking this so quickly.

Thank you for testing it. Let me know if you discover any issues.