baby-code

A simple and 100% Local, Open-Source Code 🐍 Interpreter for 🦙 LLMs

Baby Llama is:

powered by Llama.cpp
extremly SIMPLE & 100% LOCAL
CROSS-PLATFORM.

Leveraging (open source gguf models)[https://huggingface.co/models?search=gguf] and powered by llama.cpp this project is a humble foundation for enabling LLMs to act as Code Interpreters.

🏗️ Architecture (in a nutshell)

🖥️ Backend: Python Flask (CORS for serving both the API and the HTML).
🌐 Frontend: HTML/JS/CSS (I'm not a frontend dev but gave it my best shot-- prolly tons of issues).
⚙️ Engine: Llama.cpp: An inference library for ggml/gguf models).
🧠 Model: (GGUF)[https://github.com/ggerganov/llama.cpp#description] format (replacing the retired ggml format).

🦙 Features

🎊 Confetti:3
💬 Contextual Conversations: Models are augmented with the ongoing context of the conversation-- allowing them to remember and refer back to previous parts of it.
🔄 Dynamic Code Interaction: Copy, Diff, Edit, Save and Run the generated Python scripts right from the chat.
🐞 Auto-Debugging & 🏃 Auto-Run: Allow the model to automatically debug and execute any attempts at fixing issue on the fly (it will die trying).
📊 Inference & Performance Metrics: Stay informed about how fast the model is processing your requests and tally the successful vs failed script executions.
❓ Random Prompts: Not sure what to ask? Click the "Rand" button to randomly pick from a pre-defined prompt list!

🚀 Getting Started

Clone the repo:

git clone --recurse-submodules https://github.com/itsPreto/baby-code

Navigate to the llama.cpp submodule:

cd baby-code/llama.cpp

Install the required libraries:

pip install -r requirements.txt

Then repeat the same for the root project:

cd baby-code && pip install -r requirements.txt

💾 Model Download

The 7B Llama-2 based model TheBloke/WizardCoder-Python-13B-V1.0-GGUF is a model fine-tuned by a kind redditor
You may also download any other models supported by llama.cpp, of any parameter size of your choosing.
Keep in mind that the paramters might need to be tuned for your specific case:

⚠️ IMPORTANT ⚠️

This project is dependent on its submodule llama.cpp and relies on its successful build.
Please refer to their original build setup to setup on your specific OS.

🧠 Model Config

Load up your chosen model gguf for local inference using CPU or GPU by simply placing it in the llama.cpp/models folder and edit the baby_code.py init config below:

if __name__ == '__main__':
    # Run the external command
    server_process = subprocess.Popen(
        ["./llama.cpp/server", "-m", "./llama.cpp/models/wizardcoder-python-13b-v1.0.Q5_K_M.gguf", "-c", "1024",
         "-ngl", "1", "--path", "."])
    # Pause for 5 seconds
    time.sleep(5)
    app.run(args.host, port=args.port)

You may also want to customize & configure the flask server at the top of the file, like so:

parser = argparse.ArgumentParser(description="An example of using server.cpp with a similar API to OAI. It must be used together with server.cpp.")
parser.add_argument("--stop", type=str, help="the end of response in chat completions(default: '</s>')", default="</s>")
parser.add_argument("--llama-api", type=str, help="Set the address of server.cpp in llama.cpp(default: http://127.0.0.1:8080)", default='http://127.0.0.1:8080')
parser.add_argument("--api-key", type=str, help="Set the api key to allow only few user(default: NULL)", default="")
parser.add_argument("--host", type=str, help="Set the ip address to listen.(default: 127.0.0.1)", default='127.0.0.1')
parser.add_argument("--port", type=int, help="Set the port to listen.(default: 8081)", default=8081)

🏃‍♀️ Run it

From the project root simply run:

python3 baby_code.py

The server.cpp will be served to http://127.0.0.1:8080/ by default, while the the Flask (baby_code.py) currently listens on port 8081.

🌐 Endpoints

POST /completion: Given a prompt, it returns the predicted completion.

Options:

temperature: Adjust the randomness of the generated text (default: 0.8).

top_k: Limit the next token selection to the K most probable tokens (default: 40).

top_p: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).

n_predict: Set the number of tokens to predict when generating text. Note: May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: 128, -1 = infinity).

n_keep: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use -1 to retain all tokens from the initial prompt.

stream: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to true.

prompt: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate. A space is inserted in the front like main.cpp does.

stop: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration (default: []).

tfs_z: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).

typical_p: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled).

repeat_penalty: Control the repetition of token sequences in the generated text (default: 1.1).

repeat_last_n: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx-size).

penalize_nl: Penalize newline tokens when applying the repeat penalty (default: true).

presence_penalty: Repeat alpha presence penalty (default: 0.0, 0.0 = disabled).

frequency_penalty: Repeat alpha frequency penalty (default: 0.0, 0.0 = disabled);

mirostat: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0).

mirostat_tau: Set the Mirostat target entropy, parameter tau (default: 5.0).

mirostat_eta: Set the Mirostat learning rate, parameter eta (default: 0.1).

seed: Set the random number generator (RNG) seed (default: -1, -1 = random seed).

ignore_eos: Ignore end of stream token and continue generating (default: false).

logit_bias: Modify the likelihood of a token appearing in the generated text completion. For example, use "logit_bias": [[15043,1.0]] to increase the likelihood of the token 'Hello', or "logit_bias": [[15043,-1.0]] to decrease its likelihood. Setting the value to false, "logit_bias": [[15043,false]] ensures that the token Hello is never produced (default: []).
POST /tokenize: [NOT YET EXPOSED THROUGH baby-code.py] Tokenize a given text.

Options:

content: Set the text to tokenize.

Note that the special BOS token is not added in fron of the text and also a space character is not inserted automatically as it is for /completion.
POST /embedding: [NOT YET EXPOSED THROUGH baby-code.py] Generate embedding of a given text just as the embedding example does.

Options:

content: Set the text to process.
POST /run_python_code: Attempt to sanitize, format and execute the python code provided. Yields the stderr/stdout.

Options:

code: Python code (most likely generated by the llm).

🤝 Contributing

Contributions to this project are welcome. Please create a fork of the repository, make your changes, and submit a pull request. I'll be creating a few issues for feature tracking soon!!

ALSO~ If anyone would like to start a Discord channel and help me manage it that would be awesome

(I'm not on it that much).

License

This project is licensed under the MIT License.

Jipok/baby-code