llama.cpp server / embeddings broken

Question

llama.cpp server / embeddings broken

skadefro opened this issue a year ago · 10 comments

hey
I had an older git clone of llama cpp and your integration with the llamacpp server was working perfectly.
I cloned the latest version on to a new server but keept getting 'Invalid JSON response error.

RetryError: Failed after 1 attempt(s) with non-retryable error: 'Invalid JSON response'
    at _retryWithExponentialBackoff (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/core/api/retryWithExponentialBackoff.cjs:42:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async LlamaCppTextEmbeddingModel.doEmbedValues (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-provider/llamacpp/LlamaCppTextEmbeddingModel.cjs:73:26)
    at async Promise.all (index 1)
    at async generateResponse (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-function/embed/embed.cjs:44:31)
    at async runSafe (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/util/runSafe.cjs:6:35)
    at async executeStandardCall (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-function/executeStandardCall.cjs:45:20) {
  errors: [
    ApiCallError: Invalid JSON response
        at /mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/core/api/postToApi.cjs:8:15
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
        ... 6 lines matching cause stack trace ...
        at async executeStandardCall (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-function/executeStandardCall.cjs:45:20) {
      url: 'http://10.0.0.100:8080/embedding',
      requestBodyValues: [Object],
      statusCode: 200,
      cause: [ZodError],
      isRetryable: false
    }
  ],
  reason: 'errorNotRetryable'
}

After testing different git commit's of llama.cpp i found that cb33f43a2a9f5a5a5f8d290dd97c625d9ba97a2f was one of the last ones to still work. ( so one of those around that )
I know they have an issue open about implementing a new api, but to me it looks like they have not merged that yet so i hope it's a simple fix, to get this module to handle what ever they changed within the last two weeks ( just "nice to have" since they keep tweaking things and improving it, so would be nice, to be able to use latest version )
For anyone else having issues with that, you can go back to that version with

git checkout cb33f43a2a9f5a5a5f8d290dd97c625d9ba97a2f

Answer 1 · 2023-11-03T13:36:05.000Z

@skadefro thanks for letting me know. I'll take a look. I think llama.cpp has also added grammar and image support in their API, so that should be fun to explore.

Answer 2 · 2023-11-03T13:40:18.000Z

Ohh, that would be a dream come true ...
If we had an easy to use framework that would support both openai and llama.cpp and whisper.cpp for both chat/embeddings/images and voice.

Answer 3 · 2023-11-03T13:42:17.000Z

@skadefro I looked into whisper.cpp a little bit. Do you know if there are any projects that let you spin up whisper.cpp as a server?

Answer 4 · 2023-11-03T17:30:29.000Z

For someone who just want a web interface, this have worked fine for me Whisper-WebUI but i have not found any good api's for hosting it ( there is a few gist's laying around ) and I have seen at least 2 attempts on whisper.cpp's issue tracker and it looks like this one is coming out soon, so that looks promising. I had this one my my to-look-into list, but havent had time yet, but if the issue above turns into the "official" api, i would probably bet on that.
Speaking of new things; I just came across ollama .. for a die hard, docker/kubernetes fan like me, this looks very promising.

Answer 5 · 2023-11-04T10:10:17.000Z

@skadefro I just tried out the latest llama.cpp (commit hash d9b33fe95bd257b36c84ee5769cc048230067d6f) with ModelFusion and it works for me. Have you started the llama.cpp server, e.g. like this? ./server -m models/llama-2-7b-chat.GGUF.q4_0.bin -c 4096 (you need to have the model available)

If you encounter the error with the llama.cpp server, could you provide more details about your setup?

Re ollama: it's been on my list for a while, just need to get around and add it. It seems simpler to use compared to llama.cpp and could be a good alternative.

Answer 6 · 2023-11-04T10:19:30.000Z

Ah, sorry, forgot to mention what endpoint. It is the /embedding endpoint
I'm running with this.

./server -m models/llama-7b/llama-2-7b-chat.Q4_0.gguf -c 2048 --host 10.0.0.161

and testing with this

        const Llamaapi = new LlamaCppApiConfiguration({
            baseUrl: "http://10.0.0.161:8080"
        
        });
        const embeddings = await embedMany(new LlamaCppTextEmbeddingModel({
            api: Llamaapi}), [
            "At first, Nox didn't know what to do with the pup.",
            "He keenly observed and absorbed everything around him, from the birds in the sky to the trees in the forest.",
        ]);
        console.log(embeddings);

This fails if the server is built from the latest version in master, but works if i checkout a commit that is around 2 weeks old

This still works just fine, on latest version

        const text = await generateText(
            new LlamaCppTextGenerationModel({ api: Llamaapi
            }),
            "Write a short story about a robot learning to love:\n\n"
        );
        console.log(text);

Answer 7 · 2023-11-04T11:59:34.000Z

@skadefro just shipped v0.55.0 with Ollama text generation & streaming support: https://github.com/lgrammel/modelfusion/releases/tag/v0.55.0

Answer 8 · 2023-11-04T12:04:19.000Z

Wow, that was fast ... Tested and it works 😍 Thank you very much.

Answer 9 · 2023-11-04T13:08:37.000Z

This seems to be related to parallelization. Several calls are made and one is rejected by Llama.cpp because there are no free slots. Instead, an error {"content":"slot unavailable"} is returned.

❯ npx ts-node src/model-provider/llamacpp/llamacpp-embed-many-example.ts
{"content":"slot unavailable"}
RetryError: Failed after 1 attempt(s) with non-retryable error: 'Failed to process successful response'
    at _retryWithExponentialBackoff (/Users/lgrammel/repositories/modelfusion/dist/core/api/retryWithExponentialBackoff.cjs:42:15)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async LlamaCppTextEmbeddingModel.doEmbedValues (/Users/lgrammel/repositories/modelfusion/dist/model-provider/llamacpp/LlamaCppTextEmbeddingModel.cjs:73:26)
    at async Promise.all (index 1)
    at async generateResponse (/Users/lgrammel/repositories/modelfusion/dist/model-function/embed/embed.cjs:44:31)
    at async runSafe (/Users/lgrammel/repositories/modelfusion/dist/util/runSafe.cjs:6:35)
    at async executeStandardCall (/Users/lgrammel/repositories/modelfusion/dist/model-function/executeStandardCall.cjs:45:20)
    at async main (/Users/lgrammel/repositories/modelfusion/examples/basic/src/model-provider/llamacpp/llamacpp-embed-many-example.ts:4:22) {
  errors: [
    ApiCallError: Failed to process successful response
        at postToApi (/Users/lgrammel/repositories/modelfusion/dist/core/api/postToApi.cjs:94:19)
        at processTicksAndRejections (node:internal/process/task_queues:95:5)
        at async _retryWithExponentialBackoff (/Users/lgrammel/repositories/modelfusion/dist/core/api/retryWithExponentialBackoff.cjs:18:16)
        ... 3 lines matching cause stack trace ...
        at async runSafe (/Users/lgrammel/repositories/modelfusion/dist/util/runSafe.cjs:6:35)
        at async executeStandardCall (/Users/lgrammel/repositories/modelfusion/dist/model-function/executeStandardCall.cjs:45:20)
        at async main (/Users/lgrammel/repositories/modelfusion/examples/basic/src/model-provider/llamacpp/llamacpp-embed-many-example.ts:4:22) {
      url: 'http://127.0.0.1:8080/embedding',
      requestBodyValues: [Object],
      statusCode: 200,
      cause: TypeError: Body is unusable
          at specConsumeBody (node:internal/deps/undici/undici:4712:15)
          at _Response.json (node:internal/deps/undici/undici:4614:18)
          at /Users/lgrammel/repositories/modelfusion/dist/core/api/postToApi.cjs:7:66
          at processTicksAndRejections (node:internal/process/task_queues:95:5)
          at async postToApi (/Users/lgrammel/repositories/modelfusion/dist/core/api/postToApi.cjs:82:20)
          at async _retryWithExponentialBackoff (/Users/lgrammel/repositories/modelfusion/dist/core/api/retryWithExponentialBackoff.cjs:18:16)
          at async LlamaCppTextEmbeddingModel.doEmbedValues (/Users/lgrammel/repositories/modelfusion/dist/model-provider/llamacpp/LlamaCppTextEmbeddingModel.cjs:73:26)
          at async Promise.all (index 1)
          at async generateResponse (/Users/lgrammel/repositories/modelfusion/dist/model-function/embed/embed.cjs:44:31)
          at async runSafe (/Users/lgrammel/repositories/modelfusion/dist/util/runSafe.cjs:6:35),
      isRetryable: false
    }
  ],
  reason: 'errorNotRetryable'
}
{"embedding":[0.05569692328572273,-0.020548203960061073,0.27377715706825256,0.4976423382759094,0.16579614579677582,0.04679970443248749,0.19974836707115173,0.2295011579990387,-0.15478861331939697,0.3044094145298004,0.024075830355286598,-0.04952937737107277,0.1346544623374939,0.15864624083042145,-0.15292425453662872,-0.04481641948223114,0.07410169392824173,0.16139250993728638,0.013992399908602238,0.0525520034134388,0.17047853767871857,0.14821892976760864,-0.196890190243721,-0.34336787462234497,-0.03041764535009861,0.09776932001113892,0.2469785362482071,0.15258672833442688,-0.14246588945388794,0.03391014412045479,-0.20064757764339447,0.18357722461223602,-0.03650486096739769,-0.09382735937833786,-0.07598888128995895,-0.03402281180024147,-0.047186095267534256,-0.0483274981379509,-0.14382801949977875,0.17244981229305267,0.055998265743255615,-0.0007336181006394327,

Answer 10 · 2023-11-04T13:23:57.000Z

Fix in https://github.com/lgrammel/modelfusion/releases/tag/v0.55.1