Instructions are for maCOS
pip install pydriller
pip install gensim
pip install bytecode
pip install bandit
pip install gitpython
- Get list of popular repositories
python get_popular_repos.py
- Fetch issue labels for all repositories
python get_issue_labels.py
- Fetch closed PRs with specified keywords
python get_matching_prs.py
- Fetch closed PRs with specified issue labels
python get_issue_match_prs.py
- Combine keyword-wise PRs (information) into single pickle file
python combine_pr_info.py
- Fetch diffs corresponding to each PR
python fetch_diffs.py
- Download relevant files, retrieve dataset using diffs
python generate_data.py
- Save files to disk, to be used with soft-label generation
python retrieve_and_dump.py
- Generate soft-labels for each file
bandit --configfile bandit.yaml -f json -ii -o medium_filedump.json -r filedump
- Process labels from bandit and save to disk
python gen_with_soft_labels.py
- Collect repository data
python w2v_pythoncorpus.py
- Clean corpus
python w2v_cleancorpus.py
- Finetune CodeBERT LM
python lm_trainmodel.py
- Doubled the dataset size (number of repos) for the Language Model, based on various sources: Github Trending, GitMostWanted
- Explore Transformer-based language model: trainining from scratch, as well as existing CodeBERT (finetune and raw)