microsoft/vscode-ai-toolkit

Issue with POSTMAN API Calling

shreyan1999 opened this issue · 6 comments

Hello,

I have tried to make a simple POST call using the URL "http://127.0.0.1:5272/v1/chat/completions" , and the body as follows,

{
"model": "Phi-3-mini-128k-cuda-int4-onnx",
"messages": [
{
"role": "user",
"content": "Hi"
}
],
"temperature": 0.7,
"top_p": 1,
"top_k": 10,
"max_tokens": 100,
"stream": true
}

When I make the API calling, the response body is being shown as follows,

data: {
"id": "chat.id.16",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": " Hello",
"tool_calls": []
},
"logprobs": null,
"index": 0,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": " Hello",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.17",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": "!",
"tool_calls": []
},
"logprobs": null,
"index": 1,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": "!",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.18",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": " How",
"tool_calls": []
},
"logprobs": null,
"index": 2,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": " How",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.19",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": " can",
"tool_calls": []
},
"logprobs": null,
"index": 3,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": " can",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.20",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": " I",
"tool_calls": []
},
"logprobs": null,
"index": 4,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": " I",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.21",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": " assist",
"tool_calls": []
},
"logprobs": null,
"index": 5,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": " assist",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.22",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": " you",
"tool_calls": []
},
"logprobs": null,
"index": 6,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": " you",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.23",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": " today",
"tool_calls": []
},
"logprobs": null,
"index": 7,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": " today",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.24",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": "?",
"tool_calls": []
},
"logprobs": null,
"index": 8,
"finish_reason": null,
"delta": {
"role": "assistant",
"content": "?",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: {
"id": "chat.id.25",
"created": 1718722897,
"choices": [
{
"message": {
"role": "assistant",
"content": "",
"tool_calls": []
},
"logprobs": null,
"index": 9,
"finish_reason": "stop",
"delta": {
"role": "assistant",
"content": "",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

data: [DONE
]

The response is split into multiple subclasses for every word. However when I remove the following section from the body " "temperature": 0.7,
"top_p": 1,
"top_k": 10,
"max_tokens": 100,
"stream": true"

The response is coming twice without the words being divided into subclasses.

Tagging @leestott

Hi replicated this issue

Postman post method http://127.0.0.1:5272/v1/chat/completions

JSON
{
"model": "Phi-3-mini-4k-directml-int4-awq-block-128-onnx",
"messages": [
{
"role": "user",
"content": "Hi"
}
],
"temperature": 0.7,
"top_p": 1,
"top_k": 10,
"max_tokens": 100,
"stream": true
}

The issue is the response is being brought back in sub classes not a single line

Additional update

If you remove the temp-stream you get this output which is duplicated.

{
"id": "chat.id.14",
"created": 1718722757,
"choices": [
{
"message": {
"role": "assistant",
"content": " Hello! How can I assist you today?",
"tool_calls": []
},
"logprobs": null,
"index": 0,
"finish_reason": "stop",
"delta": {
"role": "assistant",
"content": " Hello! How can I assist you today?",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

@shreyan1999 @leestott , the local API is uses the OpenAI Chat Completion contract, which supports streaming.

So if using "stream": true, the response is split into chunks and please use the delta field.
If using "stream": false or removing "stream", the response is the whole message and please use the message field.

{
"id": "chat.id.1",
"created": 1719684353,
"choices": [
{
"message": {
"role": "assistant",
"content": " Hello! How can I assist you today?",
"tool_calls": []
},
"logprobs": null,
"index": 0,
"finish_reason": "stop",
"delta": {
"role": "assistant",
"content": " Hello! How can I assist you today?",
"tool_calls": []
}
}
],
"prompt_filter_results": [],
"usage": null
}

Thank you for the resolution @swatDong .Yes the issue was resolve upon setting the parameter to False. The response is still coming out twice, once under the choices->message-> content section and second time in the delta->content section. As we can have the first occurrence for the content message extraction. What is the second occurrence exactly notify?

@shreyan1999 - when using stream: false, please just ignore the delta section. The both occurrences are exactly the same.

I'll check internally for the duplicated content issue.

No activity for 2 weeks, closing this issue.
Feel free to comment or reopen, and we will re-investigate.