I see you guide to run on local, can you give me an example running on API?
Opened this issue · 1 comments
Hello, The algorithm you provided is quite novel, but I want to run it through the API, can you give me an example, I tried to modify the code but most of the times running on the API gives me Connect error
Hi, I suggest you take a look at the generate
function in the IO_System
class to see how to use gpt-3.5-turbo. The author provides this example as a guide for using the OPENAI API.
Alternatively, you can refer to my solution below. I have deployed my own vLLM service on the server. I tried it, and it works well. This setup allows me to experiment with different models, and it's faster.
Notice:
- Be sure to add required arguments in
arguments.py
.
2. Verification required: I switched from running the Qwen2.5-7B-Instruct
model locally to invoking the Qwen2.5-72B-Instruct
API, which showed a 68% speed improvement in my test. However, it resulted in half the number of total calls and seems to process significantly fewer tokens. I've only run this code briefly, so I may check later to confirm whether these findings are due to the different models or my modifications.
↑ The second point is that I previously missed the num_return
parameter. After adding it back to the API request, the final time cost and token cosumption returned to normal.
Fixed version:
class IO_System:
"""Input/Output system"""
def __init__(self, args, tokenizer, model) -> None:
# ... former code here
# added
self.api_url = args.api_url
self.model_name = args.api_model_name
def generate(self, model_input, max_tokens: int, num_return: int, stop_tokens):
import requests
io_output_list = []
if self.api == "vllm_api":
if isinstance(model_input, str):
model_input = [model_input]
if isinstance(model_input, list):
for _input in model_input:
params = {
"model": self.model_name,
"stream": False,
"temperature": self.temperature,
"top_k": self.top_k,
"top_p": self.top_p,
"max_tokens": max_tokens,
"stop": stop_tokens,
"n": num_return,
"messages": [{"role": "user", "content": _input}],
}
try:
vllm_response = requests.post(self.api_url, json=params).json()
output = vllm_response["choices"][0]["message"]["content"]
token_count = vllm_response["usage"]["completion_tokens"]
except Exception as e:
raise RuntimeError(f"API Requests Error: {e}")
io_output_list.append(output)
self.call_counter += 1
self.token_counter += token_count
return io_output_list
# ... former code here