This code is git repo for https://zenodo.org/record/2595257
Example of PathMiner's output usage written in PyTorch.
To train a toy model on your local machine follow the instructions.
Conda users:
conda env create -f conda_environment.yml
Pip users:
python3 -m virtualenv env
source env/bin/activate
pip3 install -r requirements.txt
Two projects will be loaded as one of the steps of run_example.py script. Their total size is about 200MB.
If you want to use custom projects put them in data/ folder as project1 and project2.
To run the example execute run_example.py script.
Projects are loaded unless data/project1 and data/project2 folders are already present.
PathMiner's processing takes around 1.5-2 minutes.
When loading and processing are completed, a model is trained for 10 epochs. Training takes approximately a minute.
Successful output should look like this:
Loading generated data
Labeling contexts
Creating datasets
Start training
Epoch #1
After 20 batches: average loss 0.6637925207614899
After 40 batches: average loss 0.6571866393089294
After 60 batches: average loss 0.676651856303215
...
Epoch #10
After 20 batches: average loss 0.023688357206992805
After 40 batches: average loss 0.013494743884075433
...
After 160 batches: average loss 0.008080246101599187
After 180 batches: average loss 0.009434921143110842
accuracy: 0.988, precision: 0.983, recall: 0.985
Training completed
In data_processing package you can find several classes that are capable of loading
PathMiner's generated output into easy-to-integrate format.
data_processing.PathMinerLoaderloads all data generated by PathMiner inpandas.Dataframeandpandas.Series. It doesn't depend on ML frameworks and can be used with whichever framework you like.data_processing.UtilityEntitiescontains wrappers around AST paths, nodes and contexts. For now they are only capable of storing data and pretty printing but functionality can be extended.data_processing.PathMinerDatasetis an extension ofpytorch.Dataset. It feeds data fromPathMinerLoaderto PyTorch models. To use PathMiner's output with other ML frameworks you can write similar class that transforms contents ofPathMinerLoaderto i.e. Tensorflow tensors or ndarrays.
model.CodeVectorizer contains a model to vectorize snippets of code based on their path-context representation.
This model works similarly to the part of code2vec's that is responsible for code vectorization.
It is implemented as a PyTorch module and can be easily reused.
A usage example can be found in model.ProjectClassifier.
It is a linear classifier that decides from which project does file come from based on file's vectorization.