EleutherAI/math-lm

Isabelle github data

Closed this issue · 1 comments

Add Isabelle to the process_github.py script. Some considerations:

  1. github contains multiple copies of the AFP
  2. the PISA evaluation set is built from the AFP

Working on this in https://github.com/EleutherAI/math-lm/tree/isabelle-github

For 1: we keep only mirror-afp-devel and filter other copies of the AFP out

For 2: we remove a theorem and its proof when the theorem name occurs in the universal_test_theorems set of PISA (https://github.com/albertqjiang/Portal-to-ISAbelle/blob/main/universal_test_theorems.tar.gz)

  • specifically, given a document, we check if a theorem name is present using exact string match. We remove the theorem name and all subsequent tokens up to \n\n as a heuristic for where the proof ends. If \n\n isn't present, we remove the remainder of the document.