llama.cpp server / embeddings broken
skadefro opened this issue · 10 comments
hey
I had an older git clone of llama cpp and your integration with the llamacpp server was working perfectly.
I cloned the latest version on to a new server but keept getting 'Invalid JSON response
error.
RetryError: Failed after 1 attempt(s) with non-retryable error: 'Invalid JSON response'
at _retryWithExponentialBackoff (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/core/api/retryWithExponentialBackoff.cjs:42:15)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async LlamaCppTextEmbeddingModel.doEmbedValues (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-provider/llamacpp/LlamaCppTextEmbeddingModel.cjs:73:26)
at async Promise.all (index 1)
at async generateResponse (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-function/embed/embed.cjs:44:31)
at async runSafe (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/util/runSafe.cjs:6:35)
at async executeStandardCall (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-function/executeStandardCall.cjs:45:20) {
errors: [
ApiCallError: Invalid JSON response
at /mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/core/api/postToApi.cjs:8:15
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
... 6 lines matching cause stack trace ...
at async executeStandardCall (/mnt/data/vscode/config/workspace/ai/jsagent/node_modules/modelfusion/model-function/executeStandardCall.cjs:45:20) {
url: 'http://10.0.0.100:8080/embedding',
requestBodyValues: [Object],
statusCode: 200,
cause: [ZodError],
isRetryable: false
}
],
reason: 'errorNotRetryable'
}
After testing different git commit's of llama.cpp i found that cb33f43a2a9f5a5a5f8d290dd97c625d9ba97a2f was one of the last ones to still work. ( so one of those around that )
I know they have an issue open about implementing a new api, but to me it looks like they have not merged that yet so i hope it's a simple fix, to get this module to handle what ever they changed within the last two weeks ( just "nice to have" since they keep tweaking things and improving it, so would be nice, to be able to use latest version )
For anyone else having issues with that, you can go back to that version with
git checkout cb33f43a2a9f5a5a5f8d290dd97c625d9ba97a2f
@skadefro thanks for letting me know. I'll take a look. I think llama.cpp has also added grammar and image support in their API, so that should be fun to explore.
Ohh, that would be a dream come true ...
If we had an easy to use framework that would support both openai and llama.cpp and whisper.cpp for both chat/embeddings/images and voice.
@skadefro I looked into whisper.cpp a little bit. Do you know if there are any projects that let you spin up whisper.cpp as a server?
For someone who just want a web interface, this have worked fine for me Whisper-WebUI but i have not found any good api's for hosting it ( there is a few gist's laying around ) and I have seen at least 2 attempts on whisper.cpp's issue tracker and it looks like this one is coming out soon, so that looks promising. I had this one my my to-look-into list, but havent had time yet, but if the issue above turns into the "official" api, i would probably bet on that.
Speaking of new things; I just came across ollama .. for a die hard, docker/kubernetes fan like me, this looks very promising.
@skadefro I just tried out the latest llama.cpp
(commit hash d9b33fe95bd257b36c84ee5769cc048230067d6f
) with ModelFusion and it works for me. Have you started the llama.cpp
server, e.g. like this? ./server -m models/llama-2-7b-chat.GGUF.q4_0.bin -c 4096
(you need to have the model available)
If you encounter the error with the llama.cpp server, could you provide more details about your setup?
Re ollama: it's been on my list for a while, just need to get around and add it. It seems simpler to use compared to llama.cpp and could be a good alternative.
Ah, sorry, forgot to mention what endpoint. It is the /embedding endpoint
I'm running with this.
./server -m models/llama-7b/llama-2-7b-chat.Q4_0.gguf -c 2048 --host 10.0.0.161
and testing with this
const Llamaapi = new LlamaCppApiConfiguration({
baseUrl: "http://10.0.0.161:8080"
});
const embeddings = await embedMany(new LlamaCppTextEmbeddingModel({
api: Llamaapi}), [
"At first, Nox didn't know what to do with the pup.",
"He keenly observed and absorbed everything around him, from the birds in the sky to the trees in the forest.",
]);
console.log(embeddings);
This fails if the server is built from the latest version in master, but works if i checkout a commit that is around 2 weeks old
This still works just fine, on latest version
const text = await generateText(
new LlamaCppTextGenerationModel({ api: Llamaapi
}),
"Write a short story about a robot learning to love:\n\n"
);
console.log(text);
@skadefro just shipped v0.55.0
with Ollama text generation & streaming support: https://github.com/lgrammel/modelfusion/releases/tag/v0.55.0
Wow, that was fast ... Tested and it works 😍 Thank you very much.
This seems to be related to parallelization. Several calls are made and one is rejected by Llama.cpp because there are no free slots. Instead, an error {"content":"slot unavailable"}
is returned.
❯ npx ts-node src/model-provider/llamacpp/llamacpp-embed-many-example.ts
{"content":"slot unavailable"}
RetryError: Failed after 1 attempt(s) with non-retryable error: 'Failed to process successful response'
at _retryWithExponentialBackoff (/Users/lgrammel/repositories/modelfusion/dist/core/api/retryWithExponentialBackoff.cjs:42:15)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at async LlamaCppTextEmbeddingModel.doEmbedValues (/Users/lgrammel/repositories/modelfusion/dist/model-provider/llamacpp/LlamaCppTextEmbeddingModel.cjs:73:26)
at async Promise.all (index 1)
at async generateResponse (/Users/lgrammel/repositories/modelfusion/dist/model-function/embed/embed.cjs:44:31)
at async runSafe (/Users/lgrammel/repositories/modelfusion/dist/util/runSafe.cjs:6:35)
at async executeStandardCall (/Users/lgrammel/repositories/modelfusion/dist/model-function/executeStandardCall.cjs:45:20)
at async main (/Users/lgrammel/repositories/modelfusion/examples/basic/src/model-provider/llamacpp/llamacpp-embed-many-example.ts:4:22) {
errors: [
ApiCallError: Failed to process successful response
at postToApi (/Users/lgrammel/repositories/modelfusion/dist/core/api/postToApi.cjs:94:19)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at async _retryWithExponentialBackoff (/Users/lgrammel/repositories/modelfusion/dist/core/api/retryWithExponentialBackoff.cjs:18:16)
... 3 lines matching cause stack trace ...
at async runSafe (/Users/lgrammel/repositories/modelfusion/dist/util/runSafe.cjs:6:35)
at async executeStandardCall (/Users/lgrammel/repositories/modelfusion/dist/model-function/executeStandardCall.cjs:45:20)
at async main (/Users/lgrammel/repositories/modelfusion/examples/basic/src/model-provider/llamacpp/llamacpp-embed-many-example.ts:4:22) {
url: 'http://127.0.0.1:8080/embedding',
requestBodyValues: [Object],
statusCode: 200,
cause: TypeError: Body is unusable
at specConsumeBody (node:internal/deps/undici/undici:4712:15)
at _Response.json (node:internal/deps/undici/undici:4614:18)
at /Users/lgrammel/repositories/modelfusion/dist/core/api/postToApi.cjs:7:66
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at async postToApi (/Users/lgrammel/repositories/modelfusion/dist/core/api/postToApi.cjs:82:20)
at async _retryWithExponentialBackoff (/Users/lgrammel/repositories/modelfusion/dist/core/api/retryWithExponentialBackoff.cjs:18:16)
at async LlamaCppTextEmbeddingModel.doEmbedValues (/Users/lgrammel/repositories/modelfusion/dist/model-provider/llamacpp/LlamaCppTextEmbeddingModel.cjs:73:26)
at async Promise.all (index 1)
at async generateResponse (/Users/lgrammel/repositories/modelfusion/dist/model-function/embed/embed.cjs:44:31)
at async runSafe (/Users/lgrammel/repositories/modelfusion/dist/util/runSafe.cjs:6:35),
isRetryable: false
}
],
reason: 'errorNotRetryable'
}
{"embedding":[0.05569692328572273,-0.020548203960061073,0.27377715706825256,0.4976423382759094,0.16579614579677582,0.04679970443248749,0.19974836707115173,0.2295011579990387,-0.15478861331939697,0.3044094145298004,0.024075830355286598,-0.04952937737107277,0.1346544623374939,0.15864624083042145,-0.15292425453662872,-0.04481641948223114,0.07410169392824173,0.16139250993728638,0.013992399908602238,0.0525520034134388,0.17047853767871857,0.14821892976760864,-0.196890190243721,-0.34336787462234497,-0.03041764535009861,0.09776932001113892,0.2469785362482071,0.15258672833442688,-0.14246588945388794,0.03391014412045479,-0.20064757764339447,0.18357722461223602,-0.03650486096739769,-0.09382735937833786,-0.07598888128995895,-0.03402281180024147,-0.047186095267534256,-0.0483274981379509,-0.14382801949977875,0.17244981229305267,0.055998265743255615,-0.0007336181006394327,