bigcode-project/bigcode-dataset

Jupyter NotebookApache-2.0

Issues

During your processing, have you ever encountered the need to extract part of the code? How was it handled?
#68 opened 7 months ago by cistinej
0
Most CMake files missed when categorizing by extension
#65 opened a year ago by markdewing
0
百度云连接 cloud cleaned database?
#62 opened a year ago by willshion
0
When I do pii_inference, cannot load bigcode/bigcode-encoder-pii-ner-v2
#59 opened a year ago by RuochenLowes
0
Some file extensions excluded from the published dataset (Racket)
#55 opened a year ago by flobbit1
0
HuggingFace Need Data Access Approval
#54 opened a year ago by heoun
0
From GH Archive to bigcode/the-stack-github-issues
#53 opened a year ago by yunzheng-r
0
Question: File Counts and Dataset Size
#44 opened 2 years ago by darien-schettler
1
Deduplication also removes data < ngram_size
#35 opened 2 years ago by cceyda
3
Build StackerFlow datasets
#13 opened 2 years ago by lvwerra
0
Create text-code pairs from Jupyter Notebooks
#33 opened 2 years ago by loubnabnl
0
Define filters for git commits
#32 opened 2 years ago by lvwerra
1
Define filters for cleaning GitHub issues
#31 opened 2 years ago by lvwerra
1
Run language detection GitHub issues
#30 opened 2 years ago by lvwerra
5
NER models for PII
#28 opened 2 years ago by loubnabnl
0
Refactor PII Code
#27 opened 2 years ago by loubnabnl
0
Decontaminate pretraining dataset from evaluation benchmarks
#16 opened 2 years ago by lvwerra
0
Build dataset index
#15 opened 2 years ago by lvwerra
0
Create dataset with GitHub metadata
#12 opened 2 years ago by lvwerra
0
Suggest datasets for Code Dataset Catalogue
#3 opened 2 years ago by lvwerra
7
Which languages to include?
#2 opened 2 years ago by lvwerra
20
Parse code dataset into AST
#6 opened 2 years ago by harm-devries
3
Create dataset with git commits
#19 opened 2 years ago by lvwerra
0
Convert Jupyter Notebooks to scripts
#34 opened 2 years ago by loubnabnl
0
Include code review data.
#43 opened 2 years ago by dynamicwebpaige
1
Dataset filtering based on content
#14 opened 2 years ago by lvwerra
1
Dataset filtering with additional near-deduplication
#22 opened 2 years ago by lvwerra
0
Dataset filters based on tokenizer or perplexity
#21 opened 2 years ago by lvwerra
0
Dataset filter based on code/docs ratio
#20 opened 2 years ago by lvwerra
0
TF-Update The Stack with new languages and licenses
#10 opened 2 years ago by lvwerra
5
Remove opt-out accounts
#11 opened 2 years ago by lvwerra
0
TF-PII Redaction regexes
#17 opened 2 years ago by loubnabnl
10
TF-PII Redaction Benchmark
#18 opened 2 years ago by loubnabnl
2
Redacting PII from code datasets
#1 opened 2 years ago by harm-devries
12
Decontaminate evaluation benchmarks from pretraining dataset
#7 opened 2 years ago by lvwerra
0
Suggest datasets for Code Dataset Catalogue
#4 opened 2 years ago by lvwerra
0
Include GPL licenses?
#9 opened 2 years ago by harm-devries
1
Create near-dedup dataset for all programming languages
#8 opened 2 years ago by harm-devries
7
EDA on full dataset
#5 opened 2 years ago by lvwerra
4