snorkel: A Jupyter Notebook repository from edonkor1

v0.6.3

Acknowledgements

Sponsored in part by DARPA as part of the D3M program under contract No. FA8750-17-2-0095 and the SIMPLEX program under contract number N66001-15-C-4043, and also by the NIH through the Mobilize Center under grant number U54EB020405.

Getting Started

Installation instructions below
Get started with the tutorials below
Documentation here

Motivation

Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or "dark" data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).

Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.

Users

We're lucky to have some amazing collaborators who are currently using Snorkel!

However, Snorkel is very much a work in progress, so we're eager for any and all feedback... let us know what you think and how we can improve Snorkel in the Issues section!

References

Best References:

Snorkel: Rapid Training Data Creation with Weak Supervision (VLDB 2018)
Data Programming: Creating Large Training Sets, Quickly (NIPS 2016)
Learning the Structure of Generative Models without Labeled Data (ICML 2017)
Snorkel: Fast Training Set Generation for Information Extraction (SIGMOD DEMO 2017)
Inferring Generative Model Structure with Static Analysis (NIPS 2017)
Data Programming with DDLite: Putting Humans in a Different Part of the Loop (HILDA @ SIGMOD 2016; note Snorkel was previously DDLite)
Socratic Learning: Correcting Misspecified Generative Models using Discriminative Models
Fonduer: Knowledge Base Construction from Richly Formatted Data

Learning how to use Snorkel

The introductory tutorial covers the entire Snorkel workflow, showing how to extract spouse relations from news articles. The tutorial is available in the following directory:

tutorials/intro

You can also check out all the great materials from the recent Mobilize Center-hosted Snorkel workshop!

Then, for more content, check out the other tutorials avaliable here.

Release Notes

Major changes in v0.6:

Support for categorical classification, including "dynamically-scoped" or blocked categoricals (see tutorial)
Support for structure learning (see tutorial, ICML 2017 paper)
Support for labeled data in generative model
Refactor of TensorFlow bindings; fixes grid search and model saving / reloading issues (see snorkel/learning)
New, simplified Intro tutorial (here)
Refactored parser class and support for spaCy as new default parser
Support for easy use of the BRAT annotation tool (see tutorial)
Initial Spark integration, for scale out of LF application (see tutorial)
Tutorial on using crowdsourced data here
Integration with Apache Tika via the Tika Python binding.
And many more fixes, additions, and new material!

Installation

Snorkel uses Python 2.7 or Python 3 and requires a few python packages which can be installed using conda and pip.

Setting Up Conda

Installation is easiest if you download and install conda. You can create a new conda environment with e.g.:

conda create -n py2Env python=2.7 anaconda

And then run the correct environment:

source activate py2Env

Installing dependencies

First install NUMBA, a package for high-performance numeric computing in Python via Conda:

conda install numba

Then install the remaining package requirements:

pip install --requirement python-package-requirement.txt

Finally, enable ipywidgets:

jupyter nbextension enable --py widgetsnbextension --sys-prefix

Note: If you are using conda and experience issues with lxml, try running conda install libxml2.

Note: Currently the Viewer is supported on the following versions:

jupyter: 4.1
jupyter notebook: 4.2

In some tutorials, etc. we also use Stanford CoreNLP for pre-processing text; you will be prompted to install this when you run run.sh.

Running

After installing, just run:

./run.sh

Q & A

Many questions about Snorkel get answered in the issues section--along with general discussions and conversations of interest. We tag these all as "Q&A" and save them here

Issues

We like issues as a place to put bugs, questions, feature requests, etc- don't be shy! If submitting an issue about a bug, however, please provide a pointer to a notebook (and relevant data) to reproduce it.

Note: if you have an issue with the matplotlib install related to the module freetype, see this post; if you have an issue installing ipython, try upgrading setuptools

Jupyter Notebook Best Practices

Snorkel is built specifically with usage in Jupyter/IPython notebooks in mind; an incomplete set of best practices for the notebooks:

It's usually most convenient to write most code in an external .py file, and load as a module that's automatically reloaded; use:

%load_ext autoreload
%autoreload 2

A more convenient option is to add these lines to your IPython config file, in ~/.ipython/profile_default/ipython_config.py:

c.InteractiveShellApp.extensions = ['autoreload']     
c.InteractiveShellApp.exec_lines = ['%autoreload 2']

edonkor1/snorkel