DeepDive Lite- for simple application prototyping, testing & development, as well as for learning about how to build DeepDive applications
Python
DeepDive Lite
Motivation
DeepDive Lite provides a lighter-weight interface for creating a structured information extraction application in DeepDive. DeepDive Lite is built for rapid prototyping and development focused on defining an input/output schema, and creating a set of labeling functions. The goal is to be able to directly plug these objects into DeepDive proper to get a more scalable, performant, and customizable version of the application.
An immediate motivation is to provide a lighter-weight entry point to the DeepDive application development cycle for non-expert users. DeepDive Lite may also be useful for expert DeepDive users as a toolset for development and prototyping tasks.
DeepDive Lite is also part of a broader effort to answer the following research questions:
How much progress can be made with the schema and labeling functions being the only user entry points to the application development process?
To what degree can DeepDive be seen/used as an iterative compiler, which takes in a rule-based program, and transforms it to a statistical learning and inference-based one?
Installation / dependencies
First of all, make sure all git submodules have been downloaded.
We provide a simple way to install everything using virtualenv:
# set up a Python virtualenv
virtualenv .virtualenv
source .virtualenv/bin/activate
pip install --requirement python-package-requirement.txt
Note: if you have an issue with the matplotlib install related to the module freetype, see this post; if you have an issue installing ipython, try upgrading setuptools
Alternatively, they could be installed system-wide if sudo pip is used instead of pip in the last command without the virtualenv setup and activation.
In addition the Stanford CoreNLP parser jars need to be downloaded; this can be done using:
./install-parser.sh
Finally, DeepDive Lite is built specifically with usage in Jupyter/IPython notebooks in mind.
The jupyter command is installed as part of the above installation steps, so the following command within the virtualenv opens our demo notebook.
The best way to learn how to use is to open up the demo notebooks in the examples folder. GeneTaggerExample_Extraction.ipynb walks through the candidate extraction workflow for an entity tagging task. GeneTaggerExample_Learning.ipynb picks up where the extraction notebook left off. The learning notebook demonstrates the labeling function iteration workflow and learning methods. For examples of extracting relations, see GenePhenRelationExample_Extraction.ipynb and GenePhenRelationExample_Learning.ipynb.
Best practices for labeling function iteration
The flowchart below illustrates the labeling function iteration workflow.
First, we generate candidates, a hold out set of candidates (using MindTagger or gold standard labels), and initial set of labeling functions L(0) for a CandidateModel. The next step is to examine the coverage and accuracy of labeling functions using CandidateModel.plot_lf_stats(), CandidateModel.top_conflict_lfs(), CandidateModel.lowest_coverage_lfs(), and CandidateModel.lowest_empirical_accuracy_lfs(). If coverage is the primary issue, we write a new candidate function and append it to L(0) to form L(1). If accuracy is the primary issue instead, we form L(1) by revising an existing labeling function which may be implemented incorrectly. This process continues until we are satisfied with the labeling function set. We then learn a model, and depending on the performance over the hold out set, revaluate our labeling function set as before.
Several parts of this workflow could result in overfitting, so tie your bootstraps with care:
Generating a sufficiently large and diverse hold out set before iterating on labeling functions is important for accurate evaluation.
A labeling function with low empirical accuracy on the hold out set could work well on the entire data set. This is not a reason to delete it (unless the implementation is incorrect).
Metrics obtained by altering learning parameters directly to maximize performance against the hold out set are upwards biased and not representative of general performance.
Best practices for using DeepDive Lite notebooks
Here are a few practical tips for working with DeepDive Lite:
Use autoreload
Keep working source code in another file
Pickle extractions often and with unique names
Entire objects (extractions and features) subclassed from Candidates can be pickled
Document past labeling functions either remotely or with the CandidateModel log
Documentation
ddlite_parser.py
Class: Sentence
Member
Notes
words
Tokenized words
lemmas
Tokenized lemmas
poses
Tokenized parts-of-speech
dep_parents
Dependency parent index for each token
dep_labels
Dependency label for each token
sent_id
Sentence ID
doc_id
Document ID
text
Raw sentence text
token_idxs
Document character offsets for start of each sentence token
Class: SentenceParser
Method
Notes
__init()__
Starts CoreNLPServer
parse(doc, doc_id=None)
Parse document into Sentences
Class: HTMLParser
Method
Notes
can_parse(f)
parse(f)
Returns visible text in HTML file
Class: TextParser
Method
Notes
can_parse(f)
parse(f)
Returns all text in file
Class: DocParser
Method
Notes
__init__(path, ftparser = TextParser())
path can be single file, a directory, or a glob expression
parseDocs()
Returns docs as parsed by ftparser
parseDocSentences()
Returns Sentences from SentenceParser parsing of doc content
Entire sentence text can be searched using match_attrib='text'
apply(sentence)
Tokens joined with spaces
Class: MultiMatcher
Method
Notes
__init__(matcher1, matcher2,...)
apply(sentence)
Yields individual matcher label if label argument not in initialization
ddlite.py
Class: Relation
Member
Notes
All Sentence members
prob
all_idxs
labels
xt
XMLTree
root
XMLTree root
tagged_sent
Sentence text with matched tokens replaced by labels
e1_idxs
Tokens matched by first matcher
e2_idxs
Tokens matched by second matcher
e1_label
First matcher label
e2_label
Second matcher label
Method
Notes
render
Generates sentence dependency tree figure with matched tokens highlighted
Class: Relations
Member
Notes
feats
Feature matrix
Method
Notes
__init__(content, matcher1=None, matcher2=None)
content is a list of Sentence objects, or a path to Pickled Relations object
[i]
Access ith Relation
len()
Number of relations
num_candidates()
num_feats()
extract_features(*args)
dump_candidates(f)
Pickle object to file
Class: Entity
Member
Notes
All Sentence members
prob
all_idxs
labels
xt
XMLTree
root
XMLTree root
tagged_sent
Sentence text with matched tokens replaced by label
idxs
Tokens matched by matcher
label
Matcher label
Method
Notes
render()
Generates sentence dependency tree figure with matched tokens highlighted
mention(attribute='words')
Return list of attribute tokens in mention
Class: Entities
Member
Notes
feats
Feature matrix
Method
Notes
__init__(content, matcher=None)
content is a list of Sentence objects, or a path to Pickled Entities object
[i]
Access ith Entity
len()
Number of entities
num_candidates()
num_feats()
extract_features(*args)
dump_candidates(f)
Pickle object to file
Class: CandidateModel
Member
Notes
C
Candidates object
feats
Feature matrix
logger
ModelLogger object
LF
Labeling function matrix
LF_names
Labeling functions names
X
Joint LF and feature matrix
w
Learned weights
holdout
Indices of holdout set
mindtagger_instance
MindTaggerInstance object
Method
Notes
num_candidates()
num_feats()
num_LFs()
set_gold_labels(gold)
Set gold standard labels
get_ground_truth(gt='resolve')
Get ground truth from just MindTagger (gt='mindtagger'), just gold standard labels (gt='gold'), or resolve with conflict priority to gold standard (gt='resolve')
has_ground_truth()
Get boolean array of candidates with some ground truth
Learn an elastic net logistic regression on candidates not in holdout set
nfolds: number of folds for cross-validation
maxIter: maximum number of SGD iterations
tol: tolerance for relative gradient magnitude to weights for convergence
sample: Use batch SGD
n_samples: Batch size for SGD
mu: sequence of regularization parameters to fit; if None, automatic sequence is chosen s.t. largest value is close to minimum value at which all weights are zero
n_mu: number of regularization parameters for automatic sequence
mu_min_ratio: lower bound for regularization parameter automatic sequence as ratio to largest value
alpha: elastic net mixing parameter (0 for l2, 1 for l1)
opt_1se: use one-standard-error-rule for choosing regularization parameter
use_sparse: use sparse operations
plot: show diagnostic plot for cross-validation
verbose: be verbose
log: log result in ModelLogger
get_link(subset=None)
Get link function values for all candidates (subset=None), an index subset (subset), or holdout set (subset='holdout')