alpaca-lora-7b gives method not found exception
urbien opened this issue · 12 comments
I tried the stableLM-openAssistant and it works, albeit very very slowly on my rtx 3090.
Now trying alpaca and it does not. Perhaps I did something wrong.
My models.toml
[alpaca-lora-7b]
[alpaca-lora-7b.metadata]
owned_by = 'alpaca'
permission = []
description = 'alpaca 7b'
[alpaca-lora-7b.network]
type = 'gRPC'
url = 'localhost:50051'
I rebuilt the docker container as described in examples/alpaca-lora-7b
my query is
curl http://127.0.0.1:30441/chat/completions -H "Content-Type: application/json" -d '{
"model": "alpaca-lora-7b",
"messages": [{"role": "user", "content": "Write us a python program to enumerate from 1 to 10"}],
"temperature": 0.7,
"max_tokens": 256
}'
error I get
(python3.9) tadle-325a@ino-client-trdl325a:~/simpleAI$ simple_ai serve --host 127.0.0.1 --port 30441
INFO: Started server process [95638]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:30441 (Press CTRL+C to quit)
INFO: 127.0.0.1:41142 - "POST /chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 429, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/applications.py", line 276, in __call__
await super().__call__(scope, receive, send)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
return await dependant.call(**values)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/server.py", line 120, in chat_complete
predictions = llm.chat(
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/models.py", line 128, in chat
return chat_client.run(
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/api/grpc/chat/client.py", line 51, in run
return get_chatlog(stub, grpc_chatlog)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/api/grpc/chat/client.py", line 15, in get_chatlog
response = stub.Chat(chatlog)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/grpc/_channel.py", line 1030, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = "Method not found!"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2023-05-08T03:11:30.015391213+00:00", grpc_status:12, grpc_message:"Method not found!"}"
>
INFO: 127.0.0.1:55194 - "GET /models HTTP/1.1" 200 OK
Hey,
First, thanks for the interest in this project and for giving it a try. :)
stableLM-openAssistant and it works, albeit very very slowly on my rtx 3090.
Same feedback here using the same GPU unfortunately. I wonder if there is a way to optimize a bit the model for inference, as it felt slower than other ones of comparable sizes (including the Alpaca one you've been trying too).
Regarding the Alpaca example, your curl
query is using /chat/completions
endpoint, but this is an instruction-following model, and therefore the example is using the /edit
endpoint.
Modifying your query to something like:
curl http://127.0.0.1:30441/edits \
-H "Content-Type: application/json" \
-d '{
"model": "alpaca-lora-7b",
"instruction": "Write us a python program to enumerate from 1 to 10"
}'
Should work.
and it did! Thank you!
while I have you attention, I had to change line 33 in main.py to add type=int
As to the speed, alpaca-lora-7b model works faster than the stableLM model due to LoRA optimization. This model looks promising for fun testing:
https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-4bit
But among the open-source models, MPT-7B is the best today it seems. Here is its quantized version
https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g
My python is still not very good to venture in creating an adapter for those models as you describe in your blog (I spent most my engineering time in C, Java and JavaScript). I would be very happy to test though if you whip something up.
Nice!
But among the open-source models, MPT-7B is the best today it seems. Here is its quantized version
https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g
I had the MPT models in my radar and wanted to try a 4-bit quantized version (as Llama.cpp with 4-bit quantization was fast even on CPU), didn't know it existed yet. I will probably give it a go soon then, thanks!
As to the speed, alpaca-lora-7b model works faster than the stableLM model due to LoRA optimization.
As per my limited understanding and experience of LoRA and the paper, I was thinking it is making the fine tuning process efficient and fast, but without any speedup at inference. Happy to be corrected here.
while I have you attention, I had to change line 33 in main.py to add type=int
Do you mean changing in src/simple_ai/__main__.py
:
serving_parser.add_argument("--port", default=8080)
To:
serving_parser.add_argument("--port", default=8080, type=int)
?
awesome, give me a shout as soon as you have MPT working!!
As to LoRAs, you are right LoRA should actually be slower at inference (in theory), but I am excited about LoRAs for these reasons:
- to have private LoRAs on top of the public model
- to use LoRAs to achieve infinite context (real-time learning), see this: https://twitter.com/karpathy/status/1649127655122550784
- to apply multiple LoRAs onto one model, as LoRA learnings aggregate well, note this:
https://adapterhub.ml/
Yes adapters and LoRAs are exciting! Thanks for the links!
while I have you attention, I had to change line 33 in main.py to add type=int
Do you mean changing in
src/simple_ai/__main__.py
:serving_parser.add_argument("--port", default=8080)To:
serving_parser.add_argument("--port", default=8080, type=int)?
yep
as regarding the MPT, note that they innovated not just with producing great model weights. They used half a dozen optimizations to achieve 1.5x-2x faster inference over LLaMa-7B:
- Handles extremely long inputs thanks to ALiBi (trained on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
- Optimized for fast training and inference (via FlashAttention and FasterTransformer)
- the Lion optimizer instead of AdamW, to cut optimizer state memory in half
I feel like your approach with wrapping models in a layer of code is the best in taking advantage of these optimizations, as opposed to other similar projects that just configure which of the pluggable models to load.
closing as the problem with exception resolved, I had to use a different property in json "instruction", as this is instruction-following model not a chat model.
Thanks for the pointers, will go into the details ASAP. What a time to be alive!
I feel like your approach with wrapping models in a layer of code is the best in taking advantage of these optimizations, as opposed to other similar projects that just configure which of the pluggable models to load.
Thanks for the kind words, feel free to share the project :)
I also believe it’s valuable to not have a tight coupling between models, API and UI. Lots of initiatives I see rather have a “fully packaged” approach, which is great for experimenting and setting up something quickly, but if you want to go further it has some limitations and lacks flexibility.
Will close the issue soon as it’s solved and becoming off-topic, but happy to continue this elsewhere (I have ways to contact me on my profile, or we can use the “Discussions” tab).