OoriData/OgbujiPT

Implement direct-to-llama.cpp server client class

uogbuji opened this issue · 4 comments

Right now we have 2 flavors of LLM client class: ogbujipt.llm_wrapper.openai_api which wraps the OpenAI API and ogbujipt.llm_wrapper.ctransformer which wraps ctransformers for local program space hosting. Add another which wraps the direct-HTTP llama.cpp API.

Targeted for 0.8.0 release.

Install the llama.cpp server executable using make, cmake or whatever method works. There doesn't seem to be a make install target, so I just did:

mkdir ~/.local/bin/llamacpp
cp server ~/.local/bin/llamacpp

You can then run it against a downloaded GGUF model, e.g.

~/.local/bin/llamacpp/server -m ~/.local/share/models/TheBloke_OpenHermes-2.5-Mistral-7B-16k-GGUF/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf --host 0.0.0.0 --port 8000 -c 4096 --log-format text --path ~/.local/share/llamacpp/

For 4K context listening globally (default host is loopback-only 127.0.0.1) on port 8000. llama.cpp server README details the cmdline options.

curl --request POST \
    --url http://localhost:8000/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
Sample response
{
    "content": "\n\n1. Plan your website: You will need to decide what your website’s purpose is, who its target audience is, and how you want it to look. This can be done by writing down some ideas on paper or by using a website planning template online. It will also help if you have an idea of what kind of content you want to include on the site and any features that you would like to incorporate into your design. You should also consider how you want your visitors to navigate around your site.\n\n2. Choose a domain name: A domain name is the address of your website which will appear in the URL",
    "generation_settings": {
        "dynatemp_exponent": 1.0,
        "dynatemp_range": 0.0,
        "frequency_penalty": 0.0,
        "grammar": "",
        "ignore_eos": false,
        "logit_bias": [],
        "min_keep": 0,
        "min_p": 0.05000000074505806,
        "mirostat": 0,
        "mirostat_eta": 0.10000000149011612,
        "mirostat_tau": 5.0,
        "model": "/Users/uche/.local/share/models/TheBloke_OpenHermes-2.5-Mistral-7B-16k-GGUF/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf",
        "n_ctx": 4096,
        "n_keep": 0,
        "n_predict": -1,
        "n_probs": 0,
        "penalize_nl": true,
        "penalty_prompt_tokens": [],
        "presence_penalty": 0.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.100000023841858,
        "samplers": [
            "top_k",
            "tfs_z",
            "typical_p",
            "top_p",
            "min_p",
            "temperature"
        ],
        "seed": 4294967295,
        "stop": [],
        "stream": false,
        "temperature": 0.800000011920929,
        "tfs_z": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "typical_p": 1.0,
        "use_penalty_prompt_tokens": false
    },
    "model": "/Users/uche/.local/share/models/TheBloke_OpenHermes-2.5-Mistral-7B-16k-GGUF/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf",
    "prompt": "Building a website can be done in 10 simple steps:",
    "slot_id": 0,
    "stop": true,
    "stopped_eos": false,
    "stopped_limit": true,
    "stopped_word": false,
    "stopping_word": "",
    "timings": {
        "predicted_ms": 3589.666,
        "predicted_n": 128,
        "predicted_per_second": 35.65791357747489,
        "predicted_per_token_ms": 28.044265625,
        "prompt_ms": 296.383,
        "prompt_n": 14,
        "prompt_per_second": 47.2361775135551,
        "prompt_per_token_ms": 21.170214285714284
    },
    "tokens_cached": 141,
    "tokens_evaluated": 14,
    "tokens_predicted": 128,
    "truncated": false
}

A chat example. Notice the use of min_p, which I don't think is possible via OpenAI API.

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [
{"role": "system",
    "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping the user with their requests."},
{"role": "user",
    "content": "Write a limerick about python exceptions"}
], "min_p": 0.05}'
Sample response
{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "There once was a coder named Sue,\nWho wrote code with Python so true,\nBut exceptions she'd meet,\nWith errors so sweet,\nThat left her to debug and review.",
                "role": "assistant"
            }
        }
    ],
    "created": 1709432913,
    "id": "chatcmpl-c4r6VUSbrejd2RfG4tuQWacEYEsGtBnM",
    "model": "unknown",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 42,
        "prompt_tokens": 48,
        "total_tokens": 90
    }
}

It informally feels a lot faster than running curls on locally-hosted llama-cpp-python (OpenAI API).

Examples of API usage:

import asyncio; from ogbujipt.llm_wrapper import prompt_to_chat, llama_cpp_http_chat
llm_api = llama_cpp_http_chat('http://localhost:8000')
resp = asyncio.run(llm_api(prompt_to_chat('Knock knock!'), min_p=0.05))
llm_api.first_choice_message(resp)

Chat:

import asyncio; from ogbujipt.llm_wrapper import llama_cpp_http
llm_api = llama_cpp_http('http://localhost:8000')
resp = asyncio.run(llm_api('Knock knock!', min_p=0.05))
resp['content']

Implemented llama.cpp-style API keys. Examples:

import os, asyncio; from ogbujipt.llm_wrapper import prompt_to_chat, llama_cpp_http_chat
LLAMA_CPP_APIKEY=os.environ.get('LLAMA_CPP_APIKEY')
llm_api = llama_cpp_http_chat('http://localhost:8000', apikey=LLAMA_CPP_APIKEY)
resp = asyncio.run(llm_api(prompt_to_chat('Knock knock!'), min_p=0.05))
llm_api.first_choice_message(resp)

Non-chat:

import os, asyncio; from ogbujipt.llm_wrapper import llama_cpp_http
LLAMA_CPP_APIKEY=os.environ.get('LLAMA_CPP_APIKEY')
llm_api = llama_cpp_http('http://localhost:8000', apikey=LLAMA_CPP_APIKEY)
resp = asyncio.run(llm_api('Knock knock!', min_p=0.05))
resp['content']

This work has made it clear to me that the __call__ methods of all the ogbujipt.llm_wrapper classes should always have been async. Just ripping the bandage off now and making that change 😬