Sponsored in part by DARPA as part of the SIMPLEX program under contract number N66001-15-C-4043.
Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured information extraction applications for domains in which large labeled training sets are not available or easy to obtain.
Today's state-of-the-art machine learning models, such as deep learning ones, largely automate the onerous task of feature engineering—but at the cost of requiring massive labeled training sets. Snorkel is based around the data programming paradigm, which provides a faster way to generate training data. In this approach, the developer focuses on writing a set of labeling functions which generate a large but noisy set of training labels. Snorkel then learns a generative model of this noisy labeling process—learning, essentially, which labeling functions are more accurate than others—and uses this to train any simple or state-of-the-art end model (for example, a deep neural network in TensorFlow.
Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models! Snorkel also can be seen as providing a unifying framework for various weak supervision techniques, allowing the developer to leverage all available supervision resources to train their model.
Snorkel is very much a work in progress, but some people have already begun developing applications with it... let us know what you think, and how we can improve it, in the Issues section!
- Data Programming: Creating Large Training Sets, Quickly, (NIPS 2016)
- Data Programming with DDLite: Putting Humans in a Different Part of the Loop, (HILDA @ SIGMOD 2016)
- Snorkel: A System for Lightweight Extraction, (CIDR 2017)
- Data Programming: ML with Weak Supervision (blog)
Snorkel uses Python 2.7 and requires a few python packages which can be installed using pip
:
pip install --requirement python-package-requirement.txt
If a package installation fails, then all of the packages below it in python-package-requirement.txt
will fail to install as well. This can be avoided by running the following command instead of the above:
cat python-package-requirement.txt | xargs -n 1 pip install
Note that you may have to run pip2
if you have Python3 installed on your system, and that sudo
can be prepended to install dependencies system wide if this is an option and the above does not work.
For some pointers on difficulties in using source
in shell, see Issue 506.
Finally, enable ipywidgets
:
jupyter nbextension enable --py widgetsnbextension --sys-prefix
Note: Currently the Viewer
is supported on the following versions:
jupyter
: 4.1jupyter notebook
: 4.2
By default (e.g. in the tutorials, etc.) we also use Stanford CoreNLP for pre-processing text; you will be prompted to install this when you run run.sh
.
One great option, which can make installation and use easier, is to use conda
.
If you are running multiple version of Python, you might need to run:
conda create -n py2Env python=2.7 anaconda
And then run the correct environment:
source activate py2Env
Snorkel currently relies on numbskull
and numba
, which occasionally requires a bit more work to install! One option is to use conda
as above. If installing manually, you may just need to make sure the right version of llvmlite
and LLVM is installed and used; for example on Ubuntu, run:
apt-get install llvm-3.8
LLVM_CONFIG=/usr/bin/llvm-config-3.8 pip install llvmlite
LLVM_CONFIG=/usr/bin/llvm-config-3.8 pip install numba
and on Mac OSX, one option is to use homebrew as follows:
brew install llvm38 --with-rtti
LLVM_CONFIG=/usr/local/Cellar/llvm\@3.8/3.8.1/bin/llvm-config-3.8 pip install llvmlite
LLVM_CONFIG=/usr/local/Cellar/llvm\@3.8/3.8.1/bin/llvm-config-3.8 pip install numba
Finally, once numba
is installed, re-run the numbskull
install from the python-package-requirement.txt
script:
pip install git+https://github.com/HazyResearch/numbskull@master
Alternatively, virtualenv
can be used by starting with:
virtualenv -p python2.7 .virtualenv
source .virtualenv/bin/activate
If you have issues using Jupyter notebooks with virualenv, see this tutorial
After installing (see below), just run:
./run.sh
The introductory tutorial covers the entire Snorkel workflow, showing how to extract spouse relations from news articles. The tutorial is available in the following directory:
tutorials/intro
We like issues as a place to put bugs, questions, feature requests, etc- don't be shy! If submitting an issue about a bug, however, please provide a pointer to a notebook (and relevant data) to reproduce it.
Note: if you have an issue with the matplotlib install related to the module freetype
, see this post; if you have an issue installing ipython, try upgrading setuptools
Snorkel is built specifically with usage in Jupyter/IPython notebooks in mind; an incomplete set of best practices for the notebooks:
It's usually most convenient to write most code in an external .py
file, and load as a module that's automatically reloaded; use:
%load_ext autoreload
%autoreload 2
A more convenient option is to add these lines to your IPython config file, in ~/.ipython/profile_default/ipython_config.py
:
c.InteractiveShellApp.extensions = ['autoreload']
c.InteractiveShellApp.exec_lines = ['%autoreload 2']