Isa-Eval: Isabelle Proving Agent Evaluation

Isa-Eval is a framework for evaluating Isabelle proving agents. A server is implemented in Scala to interact with underlying Isabelle process, and a client is implemented in Python to communicate with the server through gRpc and expose the API to the agent. A simple best-first search (BFS) is shipped with the server and client, which is described in GPT-f and Thor.

Some of the code is based on the PISA project.

Requirements

1. Isabelle

Isa-Eval is mainly developed for Isabelle2023, the server is known to be compatible with Isabelle2021, Isabelle2021-1, and Isabelle2022. It is the duty of the user to make sure the Isabelle version is compatible with the evaluated theory files.

Optional: If the evaluation benchmark relies on Archive of Formal Proofs (AFP), make sure first to build heap images for AFP theories, for example:

# build with 4 parallel jobs
path/to/isabelle/binary build -b -D path/to/afp/thys -j 4

Note that the building process may take a long time, and it is very memory-consuming (4 jobs may require 32 GB of RAM). The built heap images are usually stored in ~/.isabelle/Isabelle2023/heaps. For more information, please check the manual of Isabelle.

2. Scala

We recommend building and running Isa-Eval with JAVA 11, Scala 2.13.12 and sbt.

# we recommend using sdkman to install and setup Scala
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 11.0.21-tem
sdk install sbt

3. Python

Isa-Eval is developed using Python 3.10.13, but it should be compatible with most Python 3 versions. Requirements can be installed using pip:

pip install -r requirements.txt

Usage

1. Compile

Before compilation, make sure that the port 8980 is available. If not, you can change the port in src/main/scala/isa_eval/IsaEvalServer.scala (and don't forget to change the port in the Python client as well).

object IsaEvalServer extends ServerMain {
  // you can change the port here
  override def port: Int = 8980
  // ...
}

First compile the server with sbt:

# this may take a while when running for the first time
sbt
# then run 'compile' in the sbt shell, i.e.
# sbt:isa-eval> compile
compile

Then generate the Python client code:

# make sure you are in the root directory of the project
cd prise
python -m grpc_tools.protoc -I src/main/protobuf --python_out=src/main/python/isa_eval \
       --pyi_out=src/main/python/isa_eval --grpc_python_out=src/main/python/isa_eval \
       src/main/protobuf/isa_eval.proto

Alternatively, you can just run ./build.sh for the above two steps.

2. Run the server

Then start the server:

sbt run

3. Evaluate

Finally, wrap your agent by inheriting the EvalAgent class and implement the query method (and also overriding the make_input method for the IsaBestFirstSearch class). Then the evaluation can be done by calling the evaluate_isabelle_agent function.

import random

from agent import EvalAgent, EvalAgentOutput
from search import IsaBestFirstSearch
from evaluate import evaluate_isabelle_agent


class SimpleAgent(EvalAgent):
    def query(self, state: str, gen_length: int):
        return [
            EvalAgentOutput("by simp", random.random()) for _ in range(gen_length)
        ]


eval_records, times_dict = evaluate_isabelle_agent(
    isa_path="/path/to/your/Isabelle2023",
    theories_path="/path/to/evaluation/benchmark",
    agent=SimpleAgent(),
    solver=IsaBestFirstSearch(),
    session_roots=None,
    port=8980,
)

The evaluation results can be pretty printed by using pretty_print_eval_summary:

pretty_print_eval_summary(eval_records, times_dict)

# pretty print results look like:
# Solved 15 out of 248 lemmas
# Generated 248 commands, succeeded 15
# Total query count: 248, timeout count: 3
# Average evaluation time (each file): 1.1693 seconds
# Average generated proof length: 1.0000
# Average search time: 0.1433 seconds (query 0.0000 / ITP 0.1427)