Functionary

Functionary is a language model that can interpret and execute functions/plugins.

The model decides when to run a function and it can interpret the output of the function call. Only calling the functions when its necessary. The function definitions are provided as JSON Schema Objects, just like OpenAI GPT function calls, and we provide a drop-in replacement server.

We don't change the logit probabilities to conform a certain schema, but the model itself knows how to conform. This allows us to use existing tools and caching systems with ease.

Based on Llama 2.

OpenAI compatible server

Setup

Make sure you have PyTorch installed. Then:

pip install -r requirements.txt
python3 server.py --model "musabgultekin/functionary-7b-v1"

Server Usage

import openai

openai.api_key = ""
openai.api_base = "http://localhost:8000/v1"

openai.ChatCompletion.create(
    model="musabgultekin/functionary-7b-v1",
    messages=[{"role": "user", "content": "What is the weather for Istanbul?"}],
    functions=[{
        "name": "get_current_weather",
        "description": "Get the current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
            },
            "required": ["location"],
        },
    }]
)

Standalone Usage:

See: inference.py

python3 inference.py

Training

We use standard HuggingFace Trainer. When calculating the loss, we only calculate the loss on assistant outputs and assistant function calls. Not on function responses and function definitions

We use the similar hyperparameters as its used in LLama 2 paper. Except we use bigger weight decay (0.3 instead of 0.1) and warmup of 0.03, to reduce overfitting as we sample 2x of the function calling example conversations. But ablation study is required.

We use transformers after this commit. As it fixes OOM for FSDP training on Llama 2.

Hyperparameters:

Batch size: 64
Learning rate: 2e-5
Epochs: 2
Max length: 4096
Weight decay: 0.3

More info and training code will be shared soon

Evaluation

--- Work In Progress ---

Due to the unique nature, it requires custom evaluation suite. But we can probably evaluate with gpt-4-0613, likely with a similar approach like LLM Judge

Dataset

--- Work In Progress ---

Dataset preparation process consists of several steps:

Function Definitions Conversion: We begin by selecting multiple function definitions and converting them into TypeScript definitions. This approach benefits from the model's prior exposure to TypeScript tokens during the pretraining phase. See how we do it Also see Microsoft TypeChat
Human Prompts Generation: We then create human prompts that incorporate the converted TypeScript function definitions.
Function Calls Generation: Following the generation of human prompts, we proceed to generate corresponding function calls.
Function Answers Generation: Once function calls have been generated, we derive the outputs of these function calls would produce.
Function Answers Interpretation: After procuring function answers, we generate language model answers for the function response. So the model knows how to interpret the function response.
Merging and Training: We combine all the generated elements (prompts, function calls, function answers, and their interpretations) using a custom formatting. This consolidated dataset is then used for the model's training.

Note: Llama 2 70b is capable of doing all syntetic data generation.

More information about this process will be provided soon as possible.

v0.1

Data Sources:

ShareGPT 34K
Synthetic function calling dataset (2.7k examples)

Observations: This version showed limitations in handling multi-prompt conversations, likely due to the absence of multiple instructions in the function calling dataset. Also hallucinations are common, we likely need more conversation data.

v0.2

Data Sources:

ShareGPT 53K
Synthetic function calling dataset (3.5k examples). Sampled 2 times.

v1

Data Sources:

Same as v0.2

Observations: Compared to v0.2, because the model supports 4k context sizes, its much more resilient to the longer conversations and longer function definitions. Also we switched to Llama 2.

Roadmap

If I can save more money, I'll train Llama 2 13B model too, with 2x more data.
OpenAPI specification based plugin support.
Fast inference server (vLLM or text-generation-inference)
Python function calling support (Automatic detection of type annotations and calling them automatically)
Real world usage examples, such as creating agents.
Please consider opening a PR for future requests

rgbkrk/functionary-old

Functionary

OpenAI compatible server

Setup

Server Usage

Standalone Usage:

Training

Evaluation

Dataset

v0.1

v0.2

v1

Roadmap