zhentingqi/rStar

I see you guide to run on local, can you give me an example running on API?

Opened this issue · 1 comments

Hello, The algorithm you provided is quite novel, but I want to run it through the API, can you give me an example, I tried to modify the code but most of the times running on the API gives me Connect error

Hi, I suggest you take a look at the generate function in the IO_System class to see how to use gpt-3.5-turbo. The author provides this example as a guide for using the OPENAI API.

Alternatively, you can refer to my solution below. I have deployed my own vLLM service on the server. I tried it, and it works well. This setup allows me to experiment with different models, and it's faster.

Notice:

  1. Be sure to add required arguments in arguments.py.

2. Verification required: I switched from running the Qwen2.5-7B-Instruct model locally to invoking the Qwen2.5-72B-Instruct API, which showed a 68% speed improvement in my test. However, it resulted in half the number of total calls and seems to process significantly fewer tokens. I've only run this code briefly, so I may check later to confirm whether these findings are due to the different models or my modifications.

↑ The second point is that I previously missed the num_return parameter. After adding it back to the API request, the final time cost and token cosumption returned to normal.

Fixed version:

class IO_System:
    """Input/Output system"""

    def __init__(self, args, tokenizer, model) -> None:
        # ... former code here
        # added
        self.api_url = args.api_url
        self.model_name = args.api_model_name

    def generate(self, model_input, max_tokens: int, num_return: int, stop_tokens):
        import requests
        io_output_list = []

        if self.api == "vllm_api":
            if isinstance(model_input, str):
                model_input = [model_input]

            if isinstance(model_input, list):
                for _input in model_input:
                    params = {
                        "model": self.model_name,
                        "stream": False,
                        "temperature": self.temperature,
                        "top_k": self.top_k,
                        "top_p": self.top_p,
                        "max_tokens": max_tokens,
                        "stop": stop_tokens,
                        "n": num_return,
                        "messages": [{"role": "user", "content": _input}],
                    }
                    try:
                        vllm_response = requests.post(self.api_url, json=params).json()
                        output = vllm_response["choices"][0]["message"]["content"]
                        token_count = vllm_response["usage"]["completion_tokens"]
                    except Exception as e:
                        raise RuntimeError(f"API Requests Error: {e}")

                    io_output_list.append(output)
                    self.call_counter += 1
                    self.token_counter += token_count

            return io_output_list
        # ... former code here