/benchmarks-v0

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

This repo is used to run various AI benchmarks on Open Interpreter.

There is currently support for GAIA and SWE-bench


Setup

  1. Make sure the following software is installed on your computer.
  1. Run Docker

  2. Copy-paste the following into your terminal

git clone https://github.com/OpenInterpreter/benchmarks.git \
  && cd benchmarks \
  && python -m venv .venv \
  && source .venv/bin/activate \
  && python -m pip install -r requirements.txt \
  && docker build -t worker . \
  && python setup.py
  1. Enter your Huggingface token

Running Benchmarks

This section assumes:

  • benchmarks (downloaded via git in the preview section) is set as the current working directory.
  • You've activated the virtualenv with the installed prerequisite packages.
  • If using an OpenAI model, your OPENAI_API_KEY environment variable is set with a valid OpenAI API key.
  • If using a Groq model, your GROQ_API_KEY environment variable is set with a valid Groq API key.

Note: For running GAIA, you have to accept the conditions to access its files and content on Huggingface

Example: gpt-3.5-turbo, first 16 GAIA tasks, 8 docker containers

This command will output a file called output.csv containing the results of the benchmark.

python run_benchmarks.py \
  --command gpt35turbo \
  --ntasks 16 \
  --nworkers 8
  • --command gpt35turbo: Replace gpt35turbo with any existing key in the commands Dict in commands.py. Defaults to gpt35turbo.
  • --ntasks 16: Grabs the first 16 GAIA tasks to run. Defaults to all 165 GAIA validation tasks.
  • --nworkers 8: Number of docker containers to run at once. Defaults to whatever max_workers defaults to when constructing a ThreadPoolExecutor.

Troubleshooting

  • ModuleNotFoundError: No module named '_lzma' when running example.
  • ModuleNotFoundError: No module named 'pkg_resources' when running example.
    • Refer to this stackoverflow post for now.
    • OpenInterpreter should probably include setuptools in its list of dependencies, or should switch to another module that's in python's standard library.