bigcode-project/bigcode-dataset

Dataset filter based on code/docs ratio

lvwerra opened this issue · 0 comments

The ratio of code and docstrings in a document can be used as a proxy for code quality. Code with no comments or comments without any code at all could not be very useful for training and we want to test that hypothesis with some experiments.