Check out the repo on GitHub by clicking the icon below:
papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.
Papermill lets you:
- parameterize notebooks
- execute notebooks
This walk-through assumes you have conda
installed. If you do not, consider installing it for your operating system following the directions here.
Any time you want to try out a new Python package, I would highly recommend testing them in a separate conda
environment. We can create a new environment called papermill
and install python
(version 3.6
) into that environment using the following command.
conda create -n papermill python=3.6
(Note: You can omit the version number, which will install whatever python
version is used in your base
environment. Papermill should work perfectly well with python=3.5
, 3.6
, or 3.7
.)
For the examples shown in this walk-through, we're going to install papermill
, jupyterlab
, gplearn
, and scikit-learn
into our new environment.
conda activate papermill
pip install tornado==5.1.1 papermill nteract-scrapbook jupyterlab # See note below
pip install tensorflow tensorflow-datasets matplotlib
Unfortunately, as of writing (4 March 2019), pip install papermill
will pull in the latest version of jupyter
and the latest version of tornado
. However, tornado==6.0.1
(released 3 March 2019) is not compatible with jupyter
at this time. Instead, we must pin tornado
to version 5.1.1
. (Check the tornado releases page and this GitHub issue to see if this problem has been resolved by now.)
In Jupyter Notebook, configure the cell toolbar to display cell tags by navigating to View -> Cell Toolbar -> Tags
.
In JupyterLab, you must manually edit the metadata by navigating to Cell Inspector (Wrench Icon) -> Edit Metadata
, then insert the following:
{
"tags": [
"parameters"
]
}
Alternatively, install the jupyterlab/celltags
extension.
jupyter labextension install @jupyterlab/celltags
(Note: labextensions require node.js
. This can be installed separately or, if Jupyter is installed within a conda
environment, it can be installed into the same environment with conda install nodejs
.)
This is useful for more than enabling papermill parameters. See the repo on GitHub.
Recommended: install ipywidgets
and the JupyterLab extension jupyter-widgets/jupyterlab-manager
to enable progress bars in JupyterLab when using the Papermill Python API.
conda activate papermill
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension
jupyter labextension install @jupyter-widgets/jupyterlab-manager
Learn more about extensions for JupyterLab in the documentation, or find other extensions by browsing the topic on GitHub.
I'm going to take a stab at this in my own words. Parametrization is the process of identifying components of a system which can be modified while still preserving the general character of the system. For example, the linear function y = 3x + 1
can be modified by changing the slope 3
or the intercept 1
to different values while still being a linear function; as a result, we call the slope and intercept parameters of the function y
.
Parametrization is ubiquitous among computer programs. Python scripts can be parametrized using command-line arguments (modules such as argparse
simplify this process) or by prompting the user for input. Functions are frequently parametrized using keyword arguments. As such, you might think, "Why would I want to parametrize a notebook instead of writing a normal Python script or a (possibly large) function that performs the exact same task?"
Other than the fact that many people enjoy working in notebooks for rapid prototyping, Jupyter Notebooks are built for visualization. In addition, they make data vizualization
- convenient: no need to save results in intermediate files;
- reproducible: the notebook is a precise record of how a visualization was produced;
- portable: under the hood, notebooks are pure JSON (light-weight and parsable); and
- distributable: notebooks can be easily converted to HTML or PDF, or distributed in their default
ipynb
form.
So, if your program depends on or would benefit from visualization within a notebook, you can go one step farther by parametrizing that notebook. This may allow you to repoduce your visualization
- periodically: passing in the latest data each time (e.g. annual reports);
- "horizontally": applying the same analysis to different inputs (e.g. stock price analysis of different companies); or
- "vertically": applying a slightly different analysis to the same input (e.g. testing different neural networks on the same task)
In practice, you will have already created a notebook that you now want to parametrize. However, for the sake of illustration, we're going to define our parameters first, then write some code which uses our parameters to generate some output.
To parametrize our notebook, we must
- Designate a single cell with the tag
parameters
. (Note: This cell need not be the first cell in the notebook, but should be located above any cell which depends on the parameters.) - Define some parameters within the designated cell and give them default values.
There are two ways to execute notebooks:
- through the Python API (
import papermill as pm
), or - from the command line
The Python API can be used to intuitively iterate over any number of parameter combinations.
import papermill as pm
for i in [1, 2, 3]:
for j in [10, 20, 30]:
pm.execute_notebook(
"multiple-params.ipynb",
f"output-notebooks/multiple-params-{i}{j}.ipynb",
parameters = dict(int1=i, int2=j)
)
Suppost you have multiple YAML files, each specifying a particular set of parameters you want to execute with. The following command will run papermill on each parameter file and inject the name of that parameter file into the output notebook's filename.
ls *.yaml | xargs -n1 -I {} papermill -f {} input.ipynb output/output-{}.ipynb
You can also create nested for
loops in bash
to specify a grid of parameters on which to execute.
for i in 1 2 3; do
for j in 4 5 6; do
papermill input.ipynb output/output$i$j.ipynb -p p1 $i -p p2 $j;
done;
done;
-
Hyperparameter search for MNIST-trained CNN
- Possible parameters:
- network depth
- loss function
- convolution parameters
- kernel size
- kernel initializer
- kernel regularizer
- padding
- activation
- data format (may affect training wall-time)
- use bias / bias initializer
- use augmentation
- use batch norm
- use dropout
- dropout probability
- layer order (conv -> BN -> activation)
- flatten -> dense vs GAP -> dense
- Visualization:
- network diagram
- weights (histograms / normalized, single channel imshow)
- image at each layer
- metrics over time
- cross-validation results (mean and variance)
- failure analysis
- Possible parameters:
-
Company / Sector stock market analysis
- Parameters:
- company name / ticker symbol / sector label
- time interval
- Visualization:
- stock price
- trading volume
- headlines (web-scraped)
- reported earnings
- price split
- dividends
- acquisitions
- sentiment analysis
- performance against previous year(s)
- performance against companies in same sector
- Parameters:
-
Genetic Programming (specifically with
gplearn
)- Parameters:
- target function (pickled!)
- function set
- many hyperparameters, e.g. mutation probabilities
- Visualization:
- fitness vs population size (fixed # generations)
- fitness over time (fixed pop. size)
- average program size over time (fixed pop. size)
- wall-time-per-generation over time
- Parameters: