Dataset filter based on code/docs ratio
lvwerra opened this issue · 0 comments
lvwerra commented
The ratio of code and docstrings in a document can be used as a proxy for code quality. Code with no comments or comments without any code at all could not be very useful for training and we want to test that hypothesis with some experiments.