Introduction to ML-on-Code Workshop

These are materials for a workshop on "Introduction to ML-on-Code" - a guided tour on source{d} open source technology stack for Machine Learning on Code.

Slides on GDrive.

OSS tools covered:

Public Github Archive: http://pga.sourced.tech/
Siva: https://github.com/src-d/go-siva#command-line-interface
source{d} Engine: https://github.com/src-d/engine/
Project Babelfish: https://doc.bblf.sh/

Prerequisites

Docker
Go

Dependencies

Golang for CLI tools:

go get github.com/src-d/datasets/PublicGitArchive/pga
go get -u gopkg.in/src-d/go-siva.v1/...
# add "$GOPATH/bin" to "$PATH"
echo "export PATH=$PATH:$(go env GOPATH)/bin" >> ~/.bash_profile
source ~/.bash_profile

Import Docker images (works offline):

docker load -i images/engine-jupyter-bblfsh.tgz
docker load -i images/bblfshd-with-drivers.tgz

docker images

Run Bblfsh containers:

docker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd-with-drivers

docker exec -it bblfshd bblfshctl driver list

# if above did not work for some reason, use
docker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd
docker exec -it bblfshd bblfshctl driver install --recommended

Run Engine container \w Jupyter:

docker run --name engine-jupyter -it -p 8080:8080 -v $(pwd)/repositories:/repositories -v $(pwd)/notebooks:/home --link bblfshd:bblfshd srcd/engine-jupyter-bblfsh

Workflow

Workshop is structured as a sequence of steps, each introducing a layer of source{d} technology stack, from bottom up.

1. Play with PublicGithubArchive CLI

Public Github Playground is a reference dataset of full history of ~180k most popular (>50 stars) projects from Github.

710 GB of code in 3 TB of packfiles.

cp -r .pga/latest.csv.gz ~/
pga help

# number of repos from Github
pga list -u github.com/github/ -f json | wc -l

# number of repos from Github in Golang
pga list -u github.com/github/ --lang go -f json | wc -l

# pretty-print src-d repos
pga list -u github.com/src-d/ -f json | jq -r . | less

# URLs and languages for src-d repos \w more then 50 files
pga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount > 50) | .url + " " + .langs[]' | less

Materials:

2. Get used to Siva format

Seekable Indexed Block Archiver file format.

Keeps all files + updates of a single Git repository in 1 file in FS.

find ./repositories/

# list files in archive
siva list ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva

# extract single file
siva unpack -m=config ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva .
less config

# extract all files (bare Git repository)
siva unpack ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva go-kallax/.git

# list all Git objects
cd go-kallax
git verify-pack -v .git/objects/pack/pack-4a202ad08739b7236f57a3a283f45c27087a99f6.idx

# get a single object
git cat-file -p 72e6129819d6a580512f131f0c8d34cf16ffe4e5
git cat-file -p 63d6012da17573aec5d61d8ba4bae4bf8eab257e

Materials:

3. Engine (basic queries)

source{d} engine is a library that allows to query Git repositories in parallele from a cluster of machines using Apache Spark.

To start Apache Spark session:

spark-shell --packages "tech.sourced:engine:0.5.5"

Example of the query:

from sourced.engine import Engine

Engine(spark, 'siva',
         '/path/to/siva-files')
  .repositories
  .references
  .head_ref
  .files
  .classify_languages()
  .filter("lang = 'java'")
  .select('path',
          'repository_id')
  .write
  .parquet("hdfs://...")

Open in browser your Jupyter Notebook - Engine (basic) from a running Docker container.

Materials:

4. Project Babelfish

Project Babelfish provides a universal code parser - contenerized parser infrastructure, to extract uAST representation from the source code text.

Visit http://dashboard.bblf.sh/ to try experiment with uAST representation.

(: function names :)
//*[@roleFunction and @roleDeclaration and @roleName and not(@roleArgument)]
    
(: python Docstrings :)
//*[@roleFunction and @roleDeclaration and @roleBody]/*/*[@roleLiteral]
    
(: identifiers :)
//*[@roleIdentifier and not(@roleIncomplete)]

Materials:

5. Engine (advanced, UAST)

Through Engine, it is possible to parse files to uASTs using Bblfsh and then query those with XPath.

Open in browser your Jupyter Notebook - Engine (advanced) from your running Docker container.

Materials:

6. (TBD) ML: train a model

Use the data, saved from a previous step to train source code identifier embedding model with Tensorflow.

Materials:

https://blog.sourced.tech/post/id2vec/

bzz/ml-on-code

Introduction to ML-on-Code Workshop

Content

Prerequisites

Dependencies

Workflow

1. Play with PublicGithubArchive CLI

2. Get used to Siva format

3. Engine (basic queries)

4. Project Babelfish

5. Engine (advanced, UAST)

6. (TBD) ML: train a model