ProofNet Autoformalization (Eval Task)
haileyschoelkopf opened this issue · 7 comments
haileyschoelkopf commented
ProofNet Autoformalization (Eval Task)
wellecks commented
I'll take this one 👍
wellecks commented
wellecks commented
wellecks commented
A couple notes:
- We are using randomly sampled examples in the few-shot prompt (standard in the LM harness), instead of a fixed prompt like in the paper. In terms of the prompt content, one difference is that the name of each theorem is
exercise_{X}_{Y}_{Z}
in the examples, versus a 'more semantic' name in the paper's prompt (e.g.linear_independent_of_is_Ortho
). - It's computing BLEU score on tokenized sequences, using the galactica tokenizer. Of course, there is still no guarantee of a correlation with correctness, but perhaps it is less noisy than computing BLEU on whitespace-tokenized sequences. We could also take this out. I'll see if BLEU score at least shows the ordering that we expect with increased scale.
zhangir-azerbayev commented
I am currently working on a Lean 4 version of ProofNet. Because there now exist Python bindings for Lean 4, we will be able to automatically evaluate typechecking from our harness fork.
Correctness will still have to be checked manually.
wellecks commented
For informalization, added an experimental gpt-based correctness evaluation (wellecks/lm-evaluation-harness#4).
zhangir-azerbayev commented
Solved by wellecks/lm-evaluation-harness#4