alpaca-lora-7b gives method not found exception

Question

alpaca-lora-7b gives method not found exception

urbien opened this issue a year ago · 12 comments

I tried the stableLM-openAssistant and it works, albeit very very slowly on my rtx 3090.
Now trying alpaca and it does not. Perhaps I did something wrong.

My models.toml

[alpaca-lora-7b]
    [alpaca-lora-7b.metadata]
        owned_by    = 'alpaca'
        permission  = []
        description = 'alpaca 7b'
    [alpaca-lora-7b.network]
        type = 'gRPC'
        url = 'localhost:50051'

I rebuilt the docker container as described in examples/alpaca-lora-7b

my query is

 curl http://127.0.0.1:30441/chat/completions   -H "Content-Type: application/json"   -d '{
     "model": "alpaca-lora-7b",
     "messages": [{"role": "user", "content": "Write us a python program to enumerate from 1 to 10"}],
     "temperature": 0.7,
     "max_tokens": 256 
   }'

error I get

(python3.9) tadle-325a@ino-client-trdl325a:~/simpleAI$ simple_ai serve --host 127.0.0.1 --port 30441
INFO:     Started server process [95638]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30441 (Press CTRL+C to quit)
INFO:     127.0.0.1:41142 - "POST /chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 429, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/server.py", line 120, in chat_complete
    predictions = llm.chat(
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/models.py", line 128, in chat
    return chat_client.run(
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/api/grpc/chat/client.py", line 51, in run
    return get_chatlog(stub, grpc_chatlog)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/simple_ai/api/grpc/chat/client.py", line 15, in get_chatlog
    response = stub.Chat(chatlog)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/grpc/_channel.py", line 1030, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/tadle-325a/miniconda3/envs/python3.9/lib/python3.9/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNIMPLEMENTED
	details = "Method not found!"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2023-05-08T03:11:30.015391213+00:00", grpc_status:12, grpc_message:"Method not found!"}"
>
INFO:     127.0.0.1:55194 - "GET /models HTTP/1.1" 200 OK

Answer 1 · 2023-05-08T11:03:55.000Z

Hey,

First, thanks for the interest in this project and for giving it a try. :)

stableLM-openAssistant and it works, albeit very very slowly on my rtx 3090.

Same feedback here using the same GPU unfortunately. I wonder if there is a way to optimize a bit the model for inference, as it felt slower than other ones of comparable sizes (including the Alpaca one you've been trying too).

Regarding the Alpaca example, your curl query is using /chat/completions endpoint, but this is an instruction-following model, and therefore the example is using the /edit endpoint.

Modifying your query to something like:

curl http://127.0.0.1:30441/edits \
  -H "Content-Type: application/json" \
  -d '{
    "model": "alpaca-lora-7b",
    "instruction": "Write us a python program to enumerate from 1 to 10"
  }'

Should work.

Answer 2 · 2023-05-08T11:49:56.000Z

and it did! Thank you!
while I have you attention, I had to change line 33 in main.py to add type=int

As to the speed, alpaca-lora-7b model works faster than the stableLM model due to LoRA optimization. This model looks promising for fun testing:
https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-4bit

But among the open-source models, MPT-7B is the best today it seems. Here is its quantized version
https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g

My python is still not very good to venture in creating an adapter for those models as you describe in your blog (I spent most my engineering time in C, Java and JavaScript). I would be very happy to test though if you whip something up.

Answer 3 · 2023-05-08T12:04:15.000Z

Nice!

But among the open-source models, MPT-7B is the best today it seems. Here is its quantized version
https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g

I had the MPT models in my radar and wanted to try a 4-bit quantized version (as Llama.cpp with 4-bit quantization was fast even on CPU), didn't know it existed yet. I will probably give it a go soon then, thanks!

As to the speed, alpaca-lora-7b model works faster than the stableLM model due to LoRA optimization.

As per my limited understanding and experience of LoRA and the paper, I was thinking it is making the fine tuning process efficient and fast, but without any speedup at inference. Happy to be corrected here.

Answer 4 · 2023-05-08T12:10:12.000Z

while I have you attention, I had to change line 33 in main.py to add type=int

Do you mean changing in src/simple_ai/__main__.py:

    serving_parser.add_argument("--port", default=8080)

To:

    serving_parser.add_argument("--port", default=8080, type=int)

?

Answer 5 · 2023-05-08T12:13:44.000Z

awesome, give me a shout as soon as you have MPT working!!
As to LoRAs, you are right LoRA should actually be slower at inference (in theory), but I am excited about LoRAs for these reasons:

to have private LoRAs on top of the public model
to use LoRAs to achieve infinite context (real-time learning), see this: https://twitter.com/karpathy/status/1649127655122550784
to apply multiple LoRAs onto one model, as LoRA learnings aggregate well, note this:
https://adapterhub.ml/

Answer 6 · 2023-05-08T12:24:27.000Z

Yes adapters and LoRAs are exciting! Thanks for the links!

Answer 7 · 2023-05-08T14:56:56.000Z

while I have you attention, I had to change line 33 in main.py to add type=int

Do you mean changing in src/simple_ai/__main__.py:
    serving_parser.add_argument("--port", default=8080)
To:
    serving_parser.add_argument("--port", default=8080, type=int)
?

yep

Answer 8 · 2023-05-08T15:11:04.000Z

as regarding the MPT, note that they innovated not just with producing great model weights. They used half a dozen optimizations to achieve 1.5x-2x faster inference over LLaMa-7B:

Handles extremely long inputs thanks to ALiBi (trained on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
Optimized for fast training and inference (via FlashAttention and FasterTransformer)
the Lion optimizer instead of AdamW, to cut optimizer state memory in half

I feel like your approach with wrapping models in a layer of code is the best in taking advantage of these optimizations, as opposed to other similar projects that just configure which of the pluggable models to load.

Answer 9 · 2023-05-10T13:09:46.000Z

closing as the problem with exception resolved, I had to use a different property in json "instruction", as this is instruction-following model not a chat model.

Answer 10 · 2023-05-10T13:16:54.000Z

Thanks for the pointers, will go into the details ASAP. What a time to be alive!

I feel like your approach with wrapping models in a layer of code is the best in taking advantage of these optimizations, as opposed to other similar projects that just configure which of the pluggable models to load.

Thanks for the kind words, feel free to share the project :)
I also believe it’s valuable to not have a tight coupling between models, API and UI. Lots of initiatives I see rather have a “fully packaged” approach, which is great for experimenting and setting up something quickly, but if you want to go further it has some limitations and lacks flexibility.

Will close the issue soon as it’s solved and becoming off-topic, but happy to continue this elsewhere (I have ways to contact me on my profile, or we can use the “Discussions” tab).

Answer 11 · 2023-05-10T13:17:55.000Z

You’ve been 7 minutes faster than me @urbien :)

Answer 12 · 2023-05-10T13:54:59.000Z

haha, felt it was unfair to you @lhenault as the issue was resolved some time ago and I still kept it open. Yes, would love to continue our conversation. I have some specific use cases to share that we are working on - indeed, what a time be alive!!!