/miots_project

Mobile & IoT Security: Course Project

Primary LanguagePython

Mobile & IoT Security: Course Project

Setting up environment for feature extraction

Instructions are for maCOS

Install relevant python packages

pip install pydriller
pip install gensim
pip install bytecode
pip install bandit
pip install gitpython

Preparing Data

Preparing issue-based dataset

  1. Get list of popular repositories

python get_popular_repos.py

  1. Fetch issue labels for all repositories

python get_issue_labels.py

  1. Fetch closed PRs with specified keywords

python get_matching_prs.py

  1. Fetch closed PRs with specified issue labels

python get_issue_match_prs.py

  1. Combine keyword-wise PRs (information) into single pickle file

python combine_pr_info.py

  1. Fetch diffs corresponding to each PR

python fetch_diffs.py

  1. Download relevant files, retrieve dataset using diffs

python generate_data.py

  1. Save files to disk, to be used with soft-label generation

python retrieve_and_dump.py

  1. Generate soft-labels for each file

bandit --configfile bandit.yaml -f json -ii -o medium_filedump.json -r filedump

  1. Process labels from bandit and save to disk

python gen_with_soft_labels.py

Training LM for language-based analysis

  1. Collect repository data

python w2v_pythoncorpus.py

  1. Clean corpus

python w2v_cleancorpus.py

  1. Finetune CodeBERT LM

python lm_trainmodel.py

Updates in the pipeline

  1. Doubled the dataset size (number of repos) for the Language Model, based on various sources: Github Trending, GitMostWanted
  2. Explore Transformer-based language model: trainining from scratch, as well as existing CodeBERT (finetune and raw)