wellecks/llmstep

Benchmark results?

Closed this issue · 1 comments

Hey,

really like the project!
I am really interested in the performance of this this LLM vs the official Reprover.
Did you consider running a lean-dojo benchmark? If so, where can we find the results?

Hi @josojo the baseline model achieved 51.2% validation and 50.1% test on the lean-dojo random split. The numbers vary based on the search algorithm and whether we count initialization failures. The numbers above used an attempt with beam search with size 32 for up to 100 iterations, followed by an attempt with 32 temperature 0.5 samples for up to 100 iterations, and a 10 minute timeout for the whole search. Using only the beam attempt got 48.8%/47.6% validation/test, so quite similar to the non-retrieval model from lean dojo.

We should have a more precise writeup later this month, along with proofsearch/evaluation code so that others can evaluate their models using the same evaluation setup.