ProofNet Autoformalization (Eval Task)

Question

ProofNet Autoformalization (Eval Task)

haileyschoelkopf opened this issue 2 years ago · 7 comments

haileyschoelkopf commented 2 years ago

Answer 1 · 2023-03-16T20:02:45.000Z

I'll take this one 👍

Answer 2 · 2023-03-16T23:56:50.000Z

https://github.com/wellecks/lm-evaluation-harness/blob/proofnet/lm_eval/tasks/proofnet.py

Answer 3 · 2023-03-16T23:59:06.000Z

Some examples with pythia-1.4b-deduped:

Informalization:

Formalization:

Answer 4 · 2023-03-17T00:07:06.000Z

A couple notes:

We are using randomly sampled examples in the few-shot prompt (standard in the LM harness), instead of a fixed prompt like in the paper. In terms of the prompt content, one difference is that the name of each theorem is exercise_{X}_{Y}_{Z} in the examples, versus a 'more semantic' name in the paper's prompt (e.g. linear_independent_of_is_Ortho).
It's computing BLEU score on tokenized sequences, using the galactica tokenizer. Of course, there is still no guarantee of a correlation with correctness, but perhaps it is less noisy than computing BLEU on whitespace-tokenized sequences. We could also take this out. I'll see if BLEU score at least shows the ordering that we expect with increased scale.

Answer 5 · 2023-04-19T17:18:33.000Z

I am currently working on a Lean 4 version of ProofNet. Because there now exist Python bindings for Lean 4, we will be able to automatically evaluate typechecking from our harness fork.

Correctness will still have to be checked manually.

Answer 6 · 2023-05-30T17:31:56.000Z

For informalization, added an experimental gpt-based correctness evaluation (wellecks/lm-evaluation-harness#4).

Answer 7 · 2023-06-11T21:36:14.000Z

Solved by wellecks/lm-evaluation-harness#4