Library for analyzing source code with graphs and NLP. What this repository can do:
- Fetch source codes for packages in pip
- Create indexes of python packages using Sourcetrail
- Convert Sourcetrail indexes into a connected graph
- Build graphs for source codes from AST
- Train Graph Neural Network for learning representations for source code
- Predict Python types using NLP and graph embeddings
For more details consult our wiki.
You need to use conda, create virtual environment SourceCodeTools
with python 3.8
conda create -n SourceCodeTools python=3.8
If you plan to use graphviz
conda install -c conda-forge pygraphviz graphviz
Install CUDA 11.1 if needed
conda install -c nvidia cudatoolkit=11.1
To install SourceCodeTools library run
git clone https://github.com/VitalyRomanov/method-embedding.git
cd method-embedding
pip install -e .
# pip install -e .[gpu]
Source code should be structured in the following way
source_code_data
│
└───package1
│ │───source_file_1.py
│ │───source_file_2.py
│ └───subfolder_if_needed
│ │───source_file_3.py
│ └───source_file_4.py
│
└───package2
│───source_file_1.py
└───source_file_2.py
An example of source code data can be found in this repository method-embedding\res\python_testdata\example_code
. A package should contain self-sufficient code with its dependencies. Unmet dependencies will be labeled as non-indexed symbol.
To create dataset need to first perform indexing with Sourcetrail. The easiest way to do this is with a docker container
docker run -it -v "/full/path/to/data/folder":/dataset mortiv16/sourcetrail_indexer
Need to provide a sentencepiece model for subtokenization. Model trained on CodeSearchNet can be downloaded here.
SCT=/path/to/SourceCodeTool_repository
SOURCE_CODE=/path/to/source/code/indexed/with/sourcetrail
DATASET_OUTPUT=/path/to/dataset/output
python $SCT/SourceCodeTools/code/data/sourcetrail/DatasetCreator2.py --bpe_tokenizer sentencepiece_bpe.model --track_offsets --do_extraction $SOURCE_CODE $DATASET_OUTPUT
The graph dataset format is described in wiki
graph_dataset
│
└───no_ast
│ │───common_call_seq.bz2
│ │───common_edges.bz2
│ │───common_function_variable_pairs.bz2
│ │───common_nodes.bz2
│ │───common_source_graph_bodies.bz2
│ └───node_names.bz2
│
└───with_ast
│───common_call_seq.bz2
│───common_edges.bz2
│───common_function_variable_pairs.bz2
│───common_nodes.bz2
│───common_source_graph_bodies.bz2
└───node_names.bz2
no_ast
contains graph built from global relationships only. with_ast
contains graph with AST nodes and edges. Two main files for building the graph are common_nodes.bz2
and common_edges.bz2
. The files are stored as pickled pandas table (read with pandas.read_pickle
) and probably not portable between platforms. One can view the content by converting table into the csv
format
python $SCT/SourceCodeTools/code/data/sourcetrail/pandas_format_converter.py common_nodes.bz2 csv
The graph data can be loaded as pandas tables using load_data
function
from SourceCodeTools.code.data.dataset.Dataset import load_data
nodes, edges = load_data(
node_path="path/to/common_nodes.bz2",
edge_path="path/to/common_edges.bz2"
)