Response text missing when using third-party AI frontend with local endpoint

Question

Response text missing when using third-party AI frontend with local endpoint

Closed this issue 3 months ago · 4 comments

Issue Description:
The endpoint http://localhost:8000/v1/chat/completions is failing to return a proper response when integrated with third-party AI chat frontends such as Chatbox and OpenCat. These applications are compatible with OpenAI endpoints and function correctly with other OpenAI-compatible endpoints, such as Ollama.

Operating System:

macOS Sonoma 14.6.1

Python Version:

Python 3.11.4

Affected Applications:

Chatbox
OpenCat

Expected Behavior:
When using the endpoint with the third-party AI chat applications, the prompt should be sent to the server, processed correctly, and the response should be returned and displayed within the frontend application.

Observed Behavior:

The third-party application sends a request to the endpoint, and the request is received and processed correctly by the optillm.py server (e.g., using the re2-gpt4o-mini model).
The server generates a response, but when attempting to send the response back via POST, the calling app does not retrieve the response text.
The chat output appears empty within the Chatbox and OpenCat applications, even though the server logs indicate that the response was generated.

Steps to Reproduce:

Set up a local server with the endpoint http://localhost:8000/v1/chat/completions.
Integrate with a third-party AI chat frontend like Chatbox or OpenCat (configured for OpenAI-compatible endpoints).
Submit a prompt via the application.
Observe the server logs, showing the request is processed correctly.
The response is not displayed within the frontend application (empty output).

Additional Information:

The issue does not occur when using the Ollama endpoint, which works fine in the same applications.
The server does return a response, as seen in the terminal, but the response is not visible in the frontend chat UI.

Possible Causes:

There may be an issue with how the response is being returned via POST.
The format of the response may not match exactly what the third-party applications expect, which in theory should be a OpenAI compatible response

Answer 1 · 2024-09-25T15:43:53.000Z

Fixed #3 by adding support for streaming responses.

Many of the chat front-ends were looking for streaming response which was not supported in the proxy until now. I tested it with chatbox and it works now.

The only thing to be aware of is that to choose the optimisation approach in the proxy we use the slug in front of model name. But in the chatbox front-end you cannot choose the model name. So, you need to run optillm with the approach you want to test. By default if nothing is specified it will use bon or best of n.

if you want to try moa, run it as

python optillm.py --approach moa

Answer 2 · 2024-09-25T17:09:49.000Z

Many thanks, this looks fantastic, I will give it a try. On my side, I forked your project and created a little Python chat-AI script that makes use of Ollama local small/medium models such as Gemma2:9b, etc . My script also accepts commands during chat such as '/approach' in order to change the approach on the fly between mcts, moa, etc. It also allows to change the Ollama model on the fly using '/ollama'. The chat is implemented without chat memory to keep it simple and to process only the current prompt.

In any case, many thanks for putting up this project in order to easily test and play with different SOTA approaches. A fantastic work!

Answer 3 · 2024-09-26T09:36:17.000Z

Fixed #3 by adding support for streaming responses.

Many thanks, I just tried in Chatbox v1.4.2 and it worked flawlessly. One thing, it might be the Chatbox version I am using (their latest), but it actually allowed me to pre-define the list of potential models beforehand:

So I just had to run the Flask server without specifying the approach (in this case I set the base url to point to Ollama):

python optillm.py --base_url http://localhost:11434/v1

Then, when using Chatbox, I can access the list of the different approaches-models I defined beforehand in the settings, by entering directly the 'slug-modelname' string (in here, modelname is any of the available Ollama local models I installed on my machine):

I checked server side and it is in fact making use of the pre-defined approach in Chatbox (in this case, it was 'cot_reflection':

Once again, many thanks!

Answer 4 · 2024-09-26T15:03:20.000Z

Great that you were able to set it up. Thanks for the pointer about the model name, I realized I can also set it from the UI so it all works.