A framework for evaluating function calls and json output by LLMs using Hermes Tool Call and JSON-mode format.
This script evaluates the performance of a language model on a function calling and JSON output tasks. It preprocesses prompts, runs model completions, parses the function calls/json objects in the completions, validates the function calls/json objects, and calculates the pass rate.
- Clone the repository or copy the script to your local machine.
git clone https://github.com/your-repo/function-calling-eval.git
cd function-calling-eval/tool_eval- Install the required dependencies:
pip -r requirements.txt
MAX_JOBS=4 pip install flash-attn --no-build-isolation--model_path: Path to the model folder (required).--chat_template: Chat template for prompt formatting (default:"chatml").--num_fewshot: Option to subset the evaluation dataset (default:None).--dataset_path: Path to the Hugging Face dataset (default: function-calling:"NousResearch/func-calling-eval"& json-mode:"NousResearch/json-mode-eval").--load_in_4bit: Option to load the model in 4-bit mode withbitsandbytes(default:"False").--dpo: Option to save the dataset for DPO (default:"False").
python evaluator.py --model_path /path/to/model --chat_template chatml --dataset_path dataset/path --load_in_4bit True --dpo FalseThe script generates the following outputs:
function_calling_eval_results.json: A JSON file containing the function-calling evaluation results, including prompts, completions, model outputs, and pass/fail status.function_calling_dpo_pairs.json(if--dpois set to"True"): A JSON file containing the DPO dataset for function-calling consisting of system messages, questions, chosen completions, and rejected completions.
python evaluator_json_mode.py --model_path /path/to/model --load_in_4bit True --dpo FalseThe script generates the following outputs:
json_mode_eval_results.json: A JSON file containing the json-mode evaluation results, including prompts, completions, model outputs, and pass/fail status.json_mode_dpo_pairs.json(if--dpois set to"True"): A JSON file containing the DPO dataset for json-mode consisting of system messages, questions, chosen completions, and rejected completions.