EleutherAI/math-lm

Data work

Closed this issue · 2 comments

  • Julia, R, and jupyter-notebook subsets of the source code dataset have been manually inspected quite comprehensively. We need to do this with the rest of the subsets.
  • Need to manually inspect Github issues and diffs dataset.
  • Python tokenization hangs at the same particular iteration each time I try it. Figure out what's going on, it's probably caused by one very long file. Temporary solution is a short circuit that counts all python files as containing 0 tokens.

Fixing all these things in branch data_cleaning

Aforementioned branch was merged.