Issues
- 0
During your processing, have you ever encountered the need to extract part of the code? How was it handled?
#68 opened by cistinej - 0
- 0
百度云 连接 cloud cleaned database?
#62 opened by willshion - 0
- 0
- 0
HuggingFace Need Data Access Approval
#54 opened by heoun - 0
- 1
Question: File Counts and Dataset Size
#44 opened by darien-schettler - 3
Deduplication also removes data < ngram_size
#35 opened by cceyda - 0
Build StackerFlow datasets
#13 opened by lvwerra - 0
Create text-code pairs from Jupyter Notebooks
#33 opened by loubnabnl - 1
Define filters for git commits
#32 opened by lvwerra - 1
Define filters for cleaning GitHub issues
#31 opened by lvwerra - 5
Run language detection GitHub issues
#30 opened by lvwerra - 0
NER models for PII
#28 opened by loubnabnl - 0
Refactor PII Code
#27 opened by loubnabnl - 0
- 0
Build dataset index
#15 opened by lvwerra - 0
Create dataset with GitHub metadata
#12 opened by lvwerra - 7
Suggest datasets for Code Dataset Catalogue
#3 opened by lvwerra - 20
Which languages to include?
#2 opened by lvwerra - 3
Parse code dataset into AST
#6 opened by harm-devries - 0
Create dataset with git commits
#19 opened by lvwerra - 0
Convert Jupyter Notebooks to scripts
#34 opened by loubnabnl - 1
Include code review data.
#43 opened by dynamicwebpaige - 1
Dataset filtering based on content
#14 opened by lvwerra - 0
- 0
Dataset filters based on tokenizer or perplexity
#21 opened by lvwerra - 0
Dataset filter based on code/docs ratio
#20 opened by lvwerra - 5
- 0
Remove opt-out accounts
#11 opened by lvwerra - 10
TF-PII Redaction regexes
#17 opened by loubnabnl - 2
TF-PII Redaction Benchmark
#18 opened by loubnabnl - 12
Redacting PII from code datasets
#1 opened by harm-devries - 0
- 0
Suggest datasets for Code Dataset Catalogue
#4 opened by lvwerra - 1
Include GPL licenses?
#9 opened by harm-devries - 7
- 4
EDA on full dataset
#5 opened by lvwerra