PathMiner: usage

This code is git repo for https://zenodo.org/record/2595257

Example of PathMiner's output usage written in PyTorch.

Getting started

To train a toy model on your local machine follow the instructions.

Setup python environment

Conda users:

conda env create -f conda_environment.yml

Pip users:

python3 -m virtualenv env
source env/bin/activate
pip3 install -r requirements.txt

Load data

Two projects will be loaded as one of the steps of run_example.py script. Their total size is about 200MB. If you want to use custom projects put them in data/ folder as project1 and project2.

Run the example

To run the example execute run_example.py script. Projects are loaded unless data/project1 and data/project2 folders are already present. PathMiner's processing takes around 1.5-2 minutes.

When loading and processing are completed, a model is trained for 10 epochs. Training takes approximately a minute.

Successful output should look like this:

Loading generated data
Labeling contexts
Creating datasets
Start training
Epoch #1
After 20 batches: average loss 0.6637925207614899
After 40 batches: average loss 0.6571866393089294
After 60 batches: average loss 0.676651856303215
...
Epoch #10
After 20 batches: average loss 0.023688357206992805
After 40 batches: average loss 0.013494743884075433
...
After 160 batches: average loss 0.008080246101599187
After 180 batches: average loss 0.009434921143110842
accuracy: 0.988, precision: 0.983, recall: 0.985
Training completed

Useful modules

Data processing

In data_processing package you can find several classes that are capable of loading PathMiner's generated output into easy-to-integrate format.

  • data_processing.PathMinerLoader loads all data generated by PathMiner in pandas.Dataframe and pandas.Series. It doesn't depend on ML frameworks and can be used with whichever framework you like.
  • data_processing.UtilityEntities contains wrappers around AST paths, nodes and contexts. For now they are only capable of storing data and pretty printing but functionality can be extended.
  • data_processing.PathMinerDataset is an extension of pytorch.Dataset. It feeds data from PathMinerLoader to PyTorch models. To use PathMiner's output with other ML frameworks you can write similar class that transforms contents of PathMinerLoader to i.e. Tensorflow tensors or ndarrays.

Model

model.CodeVectorizer contains a model to vectorize snippets of code based on their path-context representation. This model works similarly to the part of code2vec's that is responsible for code vectorization. It is implemented as a PyTorch module and can be easily reused.

A usage example can be found in model.ProjectClassifier. It is a linear classifier that decides from which project does file come from based on file's vectorization.