EleutherAI/math-lm

ProofNet Autoformalization (Eval Task)

haileyschoelkopf opened this issue · 7 comments

ProofNet Autoformalization (Eval Task)

I'll take this one 👍

Some examples with pythia-1.4b-deduped:

Informalization:

Formalization:

A couple notes:

  1. We are using randomly sampled examples in the few-shot prompt (standard in the LM harness), instead of a fixed prompt like in the paper. In terms of the prompt content, one difference is that the name of each theorem is exercise_{X}_{Y}_{Z} in the examples, versus a 'more semantic' name in the paper's prompt (e.g. linear_independent_of_is_Ortho).
  2. It's computing BLEU score on tokenized sequences, using the galactica tokenizer. Of course, there is still no guarantee of a correlation with correctness, but perhaps it is less noisy than computing BLEU on whitespace-tokenized sequences. We could also take this out. I'll see if BLEU score at least shows the ordering that we expect with increased scale.

I am currently working on a Lean 4 version of ProofNet. Because there now exist Python bindings for Lean 4, we will be able to automatically evaluate typechecking from our harness fork.

Correctness will still have to be checked manually.

For informalization, added an experimental gpt-based correctness evaluation (wellecks/lm-evaluation-harness#4).