EleutherAI/math-lm

Filtering Github issues and diffs

Closed this issue · 1 comments

Our Github source code dataset is based on the deduplicated stack filtered down to only include numerical computing, computer algebra, and formal math.

The pilev2 includes Github issues and diffs subsets (available at s3://s-eai-neox/data/pilev2/pilev2_local_deduped/GithubDiff_ver2/ and s3://s-eai-neox/data/pilev2/pilev2_local_deduped/GithubIssue_ver2/). There is no good intrinsic way to determine whether an issue or diff meets our filtering criteria. Therefore, what we have to do is compute a table of the repositories that our included in our source code dataset, and filter issues and diffs based on that list of repositories.

To get started, study proof-pile-v2/source_code and write the script in a directory called proof-pile-v2/issues_and_diffs.

Completed in PR #18