mudler/LocalAI

could not load model - all backends returned error

aaron13100 opened this issue · 3 comments

LocalAI version:

According to git the last commit is from Sun Sep 3 02:38:52 2023 -0700 and says "added Linux Mint"

Environment, CPU architecture, OS, and Version:

Linux instance-7 6.2.0-1013-gcp #13-Ubuntu SMP Tue Aug 29 23:07:20 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
I gave the VM 8 cores and 64gigs of ram. Ubuntu 23.04.

Describe the bug

To Reproduce

I tried to specify the model at https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main. The model does appear using the curl http://localhost:8080/models/available function and does start downloading that way. The download didn't complete so I downloaded the file separately and placed it in the /models directory.

I then used

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-70b-chat.ggmlv3.q5_K_M.bin",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'

but get an error instead of a response. I also tried

  curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q5_K_M.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

and

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "huggingface@TheBloke/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q5_K_M.bin"
   }'  

Expected behavior

Some kind of answer from the model and a non-error message.

Logs

Client side:

{"error":{"code":500,"message":"could not load model - all backends returned error: 24 errors occurred:
	* could not load model: rpc error: code = Unknown desc = failed loading model
	* could not load model: rpc error: code = Unknown desc = failed loading model
	(...repeats 14 times...)
	* could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = unsupported model type /build/models/llama-2-70b-chat

The file does exist. I added some symbolic links at build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin and /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin and the errors at the end changed a bit.

	* could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = stat /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory
	* could not load model: rpc error: code = Unknown desc = unsupported model type /build/models/llama-2-70b-chat.ggmlv3.q5_K_M.bin (should end with .onnx)
	* backend unsupported: /build/extra/grpc/huggingface/huggingface.py
	* backend unsupported: /build/extra/grpc/autogptq/autogptq.py
	* backend unsupported: /build/extra/grpc/bark/ttsbark.py
	* backend unsupported: /build/extra/grpc/diffusers/backend_diffusers.py
	* backend unsupported: /build/extra/grpc/exllama/exllama.py
	* backend unsupported: /build/extra/grpc/vall-e-x/ttsvalle.py

Server side:
The log is quite long and I'm not sure what to include, but it looks like it's going through various ways to try to load the model and they all fail.

Etc
Maybe there's a different file/format I'm supposed to use?

It does load and run the example from the docs, wizardlm-13b-v1.0-superhot-8k.ggmlv3.q4_K_M.bin.
thanks

Mafyuh commented

I'm having same problem on ubuntu 23.04. Same exact issue where gallery downloading didn't download model, getting all the same rpc errors as you. I disabled ufw and reloaded the container, the model loaded, and is receiving requests, but not responding to anything. This is even using the ggml-gpt4all-j model from the getting started docs. I have tried multiple llama-2-7b-chat.ggmlv3 as well all same result.

Here are the logs when i managed to get gpt4all-j loaded, but didnt respond to any requests, with some of the rpc errors

[gptneox] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
1:26AM DBG [bert-embeddings] Attempting to load
1:26AM DBG Loading model bert-embeddings from ggml-gpt4all-j
1:26AM DBG Loading model in memory from file: /models/ggml-gpt4all-j
1:26AM DBG Loading GRPC Model bert-embeddings: {backendString:bert-embeddings model:ggml-gpt4all-j threads:2 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000020180 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
1:26AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/bert-embeddings
1:26AM DBG GRPC Service for ggml-gpt4all-j will be running at: '127.0.0.1:40785'
1:26AM DBG GRPC Service state dir: /tmp/go-processmanager2855979786
1:26AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40785: connect: connection refused"
1:26AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:40785): stderr 2023/09/13 01:26:42 gRPC Server listening at 127.0.0.1:40785
1:26AM DBG GRPC Service Ready
1:26AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:ggml-gpt4all-j ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:2 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/ggml-gpt4all-j Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false AudioPath:}
[127.0.0.1]:33128  200  -  GET      /readyz
[127.0.0.1]:55076  200  -  GET      /readyz
[127.0.0.1]:39038  200  -  GET      /readyz
[127.0.0.1]:41112  200  -  GET      /readyz
1:30AM DBG Request received: 
1:30AM DBG Configuration read: &{PredictionOptions:{Model:ggml-gpt4all-j Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:2 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
1:30AM DBG Parameters: &{PredictionOptions:{Model:ggml-gpt4all-j Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:2 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
1:30AM DBG Prompt (before templating): How are you?
[127.0.0.1]:57584  200  -  GET      /readyz

EDIT : It is something with Ubuntu or linux, same exact setup was followed but on windows 11 and it runs fine, same model (llama-2-7b-chat.ggmlv3.q4_K_M.bin), gpu and steps followed in install.

Aisuko commented

Hi, guys. Thanks for your feedback. For @aaron13100, the issue maybe the model is not complete. I saw the service cannot load the model llama-2-70b-chat.ggmlv3.q5_K_M.bin: no such file or directory (maybe you have download it to the correct path, but it may not loaded to memory correct.). And the other logs mentioned the model format should be xxxx.

I suggest you to use some easy example from gallery to do a test first. Make sure everything is fine and then you can try some customise models.

For @Mafyuh, I saw the log that everything goes well. Do you use GPU on Ubuntu? If only CPU, how long time you wait for the request?

Aisuko commented

And for the content of the log.I know the log maybe let you are confused little bit. Here is an issue related to it, #1076.