Data work
Closed this issue · 2 comments
zhangir-azerbayev commented
- Julia, R, and jupyter-notebook subsets of the source code dataset have been manually inspected quite comprehensively. We need to do this with the rest of the subsets.
- Need to manually inspect Github issues and diffs dataset.
- Python tokenization hangs at the same particular iteration each time I try it. Figure out what's going on, it's probably caused by one very long file. Temporary solution is a short circuit that counts all python files as containing 0 tokens.
zhangir-azerbayev commented
Fixing all these things in branch data_cleaning
zhangir-azerbayev commented
Aforementioned branch was merged.