/snippet-ranger

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Snippet ranger

Build Status codecov

This tool is built on top of ast2vec Machine Learning models.

Provides API and tools to train and use models for ecosystem exploratory snippet mining. It can help you to learn new libraries faster and speed up coding speed. The module allows you to train and use hierarchical topic model on top of babelfish UAST for any library you want.

Now Snippet ranger is under active development.

Install

pip3 install git+https://github.com/src-d/snippet-ranger

Usage

The project exposes two interfaces: API and command line. The command line is

snippet_ranger --help

Pipeline for dataset collection

1. Get list of dependent repositories.

You should have libraries.io (v1.0.0) dataset on your disk. You can download it here: https://libraries.io/data

Example for numpy library:

snippet_ranger dependent_reps --librariesio_data ../libio/ -o . --libraries numpy:https://github.com/numpy/numpy

There are examples of output files in data folder. You can use it to try snippet_ranger without a need to download libraries.io dataset.

2. Clone repositories

Use ast2vec clone for it. It requires enry. Install it via ast2vec enry if you do not have. Example:

ast2vec clone --ignore -o data/repos/numpy -t 16 --languages Python --linguist ./enry numpy.txt

You can skip the second step if you do not want to store repositories. But enry installation is necessary.

3. Convert to Source modelforge models

Use ast2vec repos2source for it. You should have bblfsh server running. Please use v0.7.0 and v0.8.2. of python driver:

BBLFSH_DRIVER_IMAGES="python=docker://bblfsh/python-driver:v0.8.2" docker run -e BBLFSH_DRIVER_IMAGES --rm --privileged -d -p 9432:9432 --name bblfsh bblfsh/server:v0.7.0 --log-level DEBUG

Example:

ast2vec repos2source -p 2 -t 8 --organize-files 2 -o data/sources $( find data/repos/numpy -maxdepth 1 -mindepth 1 -type d | xargs)

If you skip second step replace data/repos/numpy with data/numpy_dependent_reps.txt:

ast2vec repos2source -p 2 -t 8 --organize-files 2 -o data/sources data/numpy_dependent_reps.txt

Check ast2vec topic modeling instructions to learn more about parameters.

4. Get UAST for the library

If you use the library for Python, first you should install it to avoid autogenerated files losing. UAST is builded from installation directory:

snippet_ranger pylib2uast -p 1 -o ./data/libraries_uasts numpy

You can use other languages which are supported by bblfsh. Just download the library sources and run ast2vec repo2uast for it.

5. Extract snippets from Source model

Use snippet_ranger source2func for it.

This command does the following:

  • Filter files without library usage.
  • Split files to functions or take full file if there are no functions (just script).
  • Filter split result without library function calls.

More ways of snippet extraction can be added later.

Example:

snippet_ranger source2func -p 8 --library_name numpy --library_uast ./data/libraries_uasts/numpy.asdf -o ./data/funcs/numpy/ ./data/sources/numpy

If you have several All functions are filtered and you get empty model. errors it is ok.

6. Create vowpal wabbit dataset

Here you have two way. Default one is use all simple identifiers as tokens for document modeling, as described in 3-4 points in ast2vec topic modeling instructions.

Another one, use only specific identifiers, which can be found in the library UAST. For now, it is only about function calls (fc). Use snippet2fc_df and snippet2fc_bow for the second approach.

Example:

mkdir ./data/dfs_fc
snippet_ranger snippet2fc_df  -p 8 --library_name numpy --library_uast ./data/libraries_uasts/numpy.asdf ./data/funcs/numpy/ ./data/dfs_fc/numpy.asdf
snippet_ranger snippet2fc_bow -p 8 --df ./data/dfs_fc/numpy.asdf -v 1000000 ./data/funcs/numpy/ ./data/bows_fc/numpy

Then you need to do the same as in 5-7 points in ast2vec topic modeling:

python3 -m ast2vec join-bow -p 16 --bow ./data/bows_fc/numpy ./data/bows_fc/numpy.asdf
python3 -m ast2vec bow2vw --bow ./data/bows_fc/numpy.asdf -o ./data/vowpal_wabbit/numpy_fc.txt

Fit shallow and hierarchical topic model

On going

You should install BigARTM library. Easy way is to use ast2vec bigartm command (not implemented yet).

You can checkout simple draft experiment using BigARTM Python API notebook.

Contributions

PEP8

We use PEP8 with line length 99 and ". All the tests must pass:

unittest discover /path/to/ast2vec

License

Apache 2.0.