/ml-on-code

"Introduction to ML-on-Code" workshop materials 2018

Primary LanguageJupyter Notebook

Introduction to ML-on-Code Workshop

These are materials for a workshop on "Introduction to ML-on-Code" - a guided tour on source{d} open source technology stack for Machine Learning on Code.

Slides on GDrive.

OSS tools covered:

Content

Prerequisites

  • Docker
  • Go

Dependencies

Golang for CLI tools:

go get github.com/src-d/datasets/PublicGitArchive/pga
go get -u gopkg.in/src-d/go-siva.v1/...
# add "$GOPATH/bin" to "$PATH"
echo "export PATH=$PATH:$(go env GOPATH)/bin" >> ~/.bash_profile
source ~/.bash_profile

Import Docker images (works offline):

docker load -i images/engine-jupyter-bblfsh.tgz
docker load -i images/bblfshd-with-drivers.tgz

docker images

Run Bblfsh containers:

docker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd-with-drivers

docker exec -it bblfshd bblfshctl driver list

# if above did not work for some reason, use
docker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd
docker exec -it bblfshd bblfshctl driver install --recommended

Run Engine container \w Jupyter:

docker run --name engine-jupyter -it -p 8080:8080 -v $(pwd)/repositories:/repositories -v $(pwd)/notebooks:/home --link bblfshd:bblfshd srcd/engine-jupyter-bblfsh

Workflow

Workshop is structured as a sequence of steps, each introducing a layer of source{d} technology stack, from bottom up.

Workshop flow

1. Play with PublicGithubArchive CLI

Public Github Playground is a reference dataset of full history of ~180k most popular (>50 stars) projects from Github.

710 GB of code in 3 TB of packfiles.

cp -r .pga/latest.csv.gz ~/
pga help

# number of repos from Github
pga list -u github.com/github/ -f json | wc -l

# number of repos from Github in Golang
pga list -u github.com/github/ --lang go -f json | wc -l

# pretty-print src-d repos
pga list -u github.com/src-d/ -f json | jq -r . | less

# URLs and languages for src-d repos \w more then 50 files
pga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount > 50) | .url + " " + .langs[]' | less

Materials:

2. Get used to Siva format

Seekable Indexed Block Archiver file format.

Keeps all files + updates of a single Git repository in 1 file in FS.

find ./repositories/

# list files in archive
siva list ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva

# extract single file
siva unpack -m=config ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva .
less config

# extract all files (bare Git repository)
siva unpack ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva go-kallax/.git

# list all Git objects
cd go-kallax
git verify-pack -v .git/objects/pack/pack-4a202ad08739b7236f57a3a283f45c27087a99f6.idx

# get a single object
git cat-file -p 72e6129819d6a580512f131f0c8d34cf16ffe4e5
git cat-file -p 63d6012da17573aec5d61d8ba4bae4bf8eab257e

Materials:

3. Engine (basic queries)

source{d} engine is a library that allows to query Git repositories in parallele from a cluster of machines using Apache Spark.

To start Apache Spark session:

spark-shell --packages "tech.sourced:engine:0.5.5"

Example of the query:

from sourced.engine import Engine

Engine(spark, 'siva',
         '/path/to/siva-files')
  .repositories
  .references
  .head_ref
  .files
  .classify_languages()
  .filter("lang = 'java'")
  .select('path',
          'repository_id')
  .write
  .parquet("hdfs://...")

Open in browser your Jupyter Notebook - Engine (basic) from a running Docker container.

Materials:

4. Project Babelfish

Babelfish logo

Project Babelfish provides a universal code parser - contenerized parser infrastructure, to extract uAST representation from the source code text.

Visit http://dashboard.bblf.sh/ to try experiment with uAST representation.

(: function names :)
//*[@roleFunction and @roleDeclaration and @roleName and not(@roleArgument)]
    
(: python Docstrings :)
//*[@roleFunction and @roleDeclaration and @roleBody]/*/*[@roleLiteral]
    
(: identifiers :)
//*[@roleIdentifier and not(@roleIncomplete)]

Materials:

5. Engine (advanced, UAST)

Through Engine, it is possible to parse files to uASTs using Bblfsh and then query those with XPath.

Open in browser your Jupyter Notebook - Engine (advanced) from your running Docker container.

Materials:

6. (TBD) ML: train a model

Use the data, saved from a previous step to train source code identifier embedding model with Tensorflow.

Materials: