CodeBotler Overview

CodeBotler is a system that converts natural language task descriptions into robot-agnostic programs that can be executed by general-purpose service mobile robots. It includes a benchmark (RoboEval) designed for evaluating Large Language Models (LLMs) in the context of code generation for mobile robot service tasks.

This project consists of two key components:

CodeBotler: This system features a web interface designed for generating general-purpose service mobile robot programs, along with a ROS (Robot Operating System) Action client for deploying these programs on a robot. It offers the flexibility to explore the code generation capabilities of CodeBotler in two ways: as a standalone system without a robot, as illustrated in the figure above, or by actual deployment on a real robot.
RoboEval: This benchmark for code generation features a suite of 16 user task descriptions, each with 5 paraphrases of the prompt. It includes a symbolic simulator and a temporal trace evaluator, specifically designed to assess Large Language Models (LLMs) in their ability to generate code for service mobile robot tasks.

Project website: https://amrl.cs.utexas.edu/codebotler

Requirements

We provide a conda environment to run our code. To create and activate the environment:

conda create -n codebotler python=3.10
conda activate codebotler
pip install -r requirements.txt

After installing the conda environment, please go to pytorch's official website to install the pytorch corresponding to your cuda version (Note: do not install the cpu version).

Language Model Options

To use an OpenAI model, you will need an OpenAI key, either saved in a file named .openai_api_key, or in the OPENAI_API_KEY environment variable.
To use a PaLM model, you will need a Google Generative API key, either saved in a file named .palm_api_key, or in the PALM_API_KEY environment variable.
You can use any pretrained model compatible with the HuggingFace AutoModel interface, including open-source models from the HuggingFace repository such as Starcoder. Note that some models, including Starcoder, require you to agree to the HuggingFace terms of use, and you must be logged in using huggingface-cli login.
You can also use a HuggingFace Inference Endpoint.

CodeBotler Deployment Quick-Start Guide

To run the web interface for CodeBotler-Deploy using the default options (using OpenAI's gpt-4 model), run:

python3 codebotler.py

This will start the server on localhost:8080. You can then open the interface by navigating to http://localhost:8080/ in your browser.

List of arguments:

--ip: The IP address to host the server on (default is localhost).
--port: The port to host the server on (default is 8080).
--ws-port: The port to host the websocket server on (default is 8190).
--model-type: The type of model to use. It is either openai-chat (default) and openai for OpenAI, palm for PaLM, or automodel for AutoModel.
--model-name: The name of the model to use. Recommended options are gpt-4 for GPT-4 (default), text-daVinci-003 for GPT-3.5, models/text-bison-001 for PaLM, and bigcode/starcoder for AutoModel.
--robot: Flag to indicate if the robot is available (default is False).

Instructions for deploying on real robots are included in robot_interface/README.md.

RoboEval Benchmark Quick-Start Guide

The instructions below demonstrate how to run the benchmark using the open-source StarCoder model.

Run code generation for the benchmark tasks using the following command:
```
python3 roboeval.py --generate --generate-output completions/starcoder \
    --model-type automodel --model-name "bigcode/starcoder" 
```
This will generate the programs for the benchmark tasks and save them as a Python file in an output directory completions/starcoder. It assumes default values for temperature (0.2), top-p (0.9), and num-completions (20), to generate 20 programs for each task --- this will suffice for pass@1 evaluation.

If you would rather not re-run inference, we have included saved output from every model in the completions/ directory as a zip file. You can simply run.
```
cd completions
unzip -d <MODEL_NAME> <MODEL_NAME>.zip
```
For example, you can run:
```
cd completions
unzip -d gpt4 gpt4.zip
```
Evaluate the generated programs using the following command:
```
python3 roboeval.py --evaluate --generate-output <Path-To-Program-Completion-Directory> --evaluate-output <Path-To-Evaluation-Result-File-Name>
```
For example:
```
python3 roboeval.py --evaluate --generate-output completions/gpt4/ --evaluate-output benchmark/evaluations/gpt4
```
This will evaluate the generated programs from the previous step, and save all the evaluation results in an python file.

If you would rather not re-run evaluation, we have included saved evaluation output from every model in the benchmark/evaluations directory.

Finally, you can compute pass@1 score for every task:

python3 evaluate_pass1.py --llm codellama --tasks all

python3 evaluate_pass1.py --llm codellama --tasks CountSavory WeatherPoll

ut-amrl/codebotler

CodeBotler Overview

Requirements

CodeBotler Deployment Quick-Start Guide

RoboEval Benchmark Quick-Start Guide