Snorkel

Acknowledgements

Sponsored in part by DARPA as part of the [SIMPLEX](http://www.darpa.mil/program/simplifying-complexity-in-scientific-discovery) program under contract number N66001-15-C-4043.

Versions

master: Current stable development version (v0.3)
v0.2-stable: Last full version (v0.2)

Running

After installing (see below), just run:

./run.sh

Motivation

Snorkel is intended to be a lightweight but powerful framework for developing structured information extraction applications for domains in which large labeled training sets are not available or easy to obtain, using the data programming paradigm.

In the data programming approach to developing a machine learning system, the developer focuses on writing a set of labeling functions, which create a large but noisy training set. Snorkel then learns a generative model of this noise—learning, essentially, which labeling functions are more accurate than others—and uses this to train a discriminative classifier.

At a high level, the idea is that developers can focus on writing labeling functions—which are just (Python) functions that provide a label for some subset of data points—and not think about algorithms or features!

Snorkel is very much a work in progress, but some people have already begun developing applications with it, and initial feedback has been positive... let us know what you think, and how we can improve it, in the Issues section!

References

Technical report on Data Programming: https://arxiv.org/abs/1605.07723
Workshop paper from HILDA 2016 (note Snorkel was previously DDLite): here

Installation / dependencies

Snorkel requires a few python packages including:

We provide a simple way to install everything using virtualenv:

# set up a Python virtualenv
virtualenv .virtualenv
source .virtualenv/bin/activate

pip install --requirement python-package-requirement.txt

Finally, enable ipywidgets:

jupyter nbextension enable --py widgetsnbextension --sys-prefix

Note: if you have an issue with the matplotlib install related to the module freetype, see this post; if you have an issue installing ipython, try upgrading setuptools

Alternatively, they could be installed system-wide if sudo pip is used instead of pip in the last command without the virtualenv setup and activation.

Learning how to use Snorkel

New tutorial (in progress; covers through candidate extraction for entities):

tutorial/CDR_tutorial.ipynb

Supported legacy tutorial (covers full pipeline):

GeneTaggerExample_Extraction.ipynb walks through the candidate extraction workflow for an entity tagging task. * GeneTaggerExample_Learning.ipynb picks up where the extraction notebook left off. The learning notebook demonstrates the labeling function iteration workflow and learning methods.

Documentation

To generate documentation (built using pdoc), run ./generate_docs.sh.

Issues

We like issues as a place to put bugs, questions, feature requests, etc- don't be shy! If submitting an issue about a bug, however, please provide a pointer to a notebook (and relevant data) to reproduce it.

Jupyter Notebook Best Practices

Snorkel is built specifically with usage in Jupyter/IPython notebooks in mind; an incomplete set of best practices for the notebooks:

It's usually most convenient to write most code in an external .py file, and load as a module that's automatically reloaded; use:

%load_ext autoreload
%autoreload 2

A more convenient option is to add these lines to your IPython config file, in ~/.ipython/profile_default/ipython_config.py:

c.InteractiveShellApp.extensions = ['autoreload']     
c.InteractiveShellApp.exec_lines = ['%autoreload 2']

mooz/snorkel