bigcode-project/bigcode-dataset

Some file extensions excluded from the published dataset (Racket)

flobbit1 opened this issue · 0 comments

programming-languages-to-file-extensions.json correctly has the most common rkt file extension of 'rkt' for Racket, but the data subset (for Racket) at https://huggingface.co/datasets/bigcode/the-stack/tree/main/data/racket has zero instances of files with this extension, and rkt is mentioned specifically as being an excluded extension in the paper at https://arxiv.org/abs/2305.06161 This would likely exclude the majority of actual racket files found on github.