Data work

Question

Closed this issue 2 years ago · 2 comments

Julia, R, and jupyter-notebook subsets of the source code dataset have been manually inspected quite comprehensively. We need to do this with the rest of the subsets.
Need to manually inspect Github issues and diffs dataset.
Python tokenization hangs at the same particular iteration each time I try it. Figure out what's going on, it's probably caused by one very long file. Temporary solution is a short circuit that counts all python files as containing 0 tokens.

Answer 1 · 2023-05-29T16:04:41.000Z

Fixing all these things in branch data_cleaning

Answer 2 · 2023-05-31T20:08:43.000Z

Aforementioned branch was merged.