/python4data-papermill

A presentation on Papermill for the "Python for Data Science" group @ University of Idaho

Primary LanguageJupyter Notebook

An introduction to the papermill package

(and the scrapbook package if we have time...)

Check out the repo on GitHub by clicking the icon below:

Build Status image Documentation Status badge badge Code style: black

papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.

Papermill lets you:

  • parameterize notebooks
  • execute notebooks

Before we begin

This walk-through assumes you have conda installed. If you do not, consider installing it for your operating system following the directions here.

Set up a new conda environment

Any time you want to try out a new Python package, I would highly recommend testing them in a separate conda environment. We can create a new environment called papermill and install python (version 3.6) into that environment using the following command.

conda create -n papermill python=3.6

(Note: You can omit the version number, which will install whatever python version is used in your base environment. Papermill should work perfectly well with python=3.5, 3.6, or 3.7.)

Activate environment and install packages

For the examples shown in this walk-through, we're going to install papermill, jupyterlab, gplearn, and scikit-learn into our new environment.

conda activate papermill
pip install tornado==5.1.1 papermill nteract-scrapbook jupyterlab  # See note below
pip install tensorflow tensorflow-datasets matplotlib

Unfortunately, as of writing (4 March 2019), pip install papermill will pull in the latest version of jupyter and the latest version of tornado. However, tornado==6.0.1 (released 3 March 2019) is not compatible with jupyter at this time. Instead, we must pin tornado to version 5.1.1. (Check the tornado releases page and this GitHub issue to see if this problem has been resolved by now.)

Setting up Jupyter Notebook and JupyterLab for use with papermill

In Jupyter Notebook, configure the cell toolbar to display cell tags by navigating to View -> Cell Toolbar -> Tags.

gif: how to enable parameters in Jupyter Notebook

In JupyterLab, you must manually edit the metadata by navigating to Cell Inspector (Wrench Icon) -> Edit Metadata, then insert the following:

{
    "tags": [
        "parameters"
    ]
}

gif: edit meta data in JupyterLab

Alternatively, install the jupyterlab/celltags extension.

jupyter labextension install @jupyterlab/celltags

(Note: labextensions require node.js. This can be installed separately or, if Jupyter is installed within a conda environment, it can be installed into the same environment with conda install nodejs.)

gif: celltags extension in JupyterLab

This is useful for more than enabling papermill parameters. See the repo on GitHub.

Recommended: install ipywidgets and the JupyterLab extension jupyter-widgets/jupyterlab-manager to enable progress bars in JupyterLab when using the Papermill Python API.

conda activate papermill
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension
jupyter labextension install @jupyter-widgets/jupyterlab-manager

gif: ipywidget progress bars

Learn more about extensions for JupyterLab in the documentation, or find other extensions by browsing the topic on GitHub.

What is parametrization?

I'm going to take a stab at this in my own words. Parametrization is the process of identifying components of a system which can be modified while still preserving the general character of the system. For example, the linear function y = 3x + 1 can be modified by changing the slope 3 or the intercept 1 to different values while still being a linear function; as a result, we call the slope and intercept parameters of the function y.

Why should I be interested in parametrizing a notebook?

Parametrization is ubiquitous among computer programs. Python scripts can be parametrized using command-line arguments (modules such as argparse simplify this process) or by prompting the user for input. Functions are frequently parametrized using keyword arguments. As such, you might think, "Why would I want to parametrize a notebook instead of writing a normal Python script or a (possibly large) function that performs the exact same task?"

Other than the fact that many people enjoy working in notebooks for rapid prototyping, Jupyter Notebooks are built for visualization. In addition, they make data vizualization

  1. convenient: no need to save results in intermediate files;
  2. reproducible: the notebook is a precise record of how a visualization was produced;
  3. portable: under the hood, notebooks are pure JSON (light-weight and parsable); and
  4. distributable: notebooks can be easily converted to HTML or PDF, or distributed in their default ipynb form.

So, if your program depends on or would benefit from visualization within a notebook, you can go one step farther by parametrizing that notebook. This may allow you to repoduce your visualization

  1. periodically: passing in the latest data each time (e.g. annual reports);
  2. "horizontally": applying the same analysis to different inputs (e.g. stock price analysis of different companies); or
  3. "vertically": applying a slightly different analysis to the same input (e.g. testing different neural networks on the same task)

Create a parametrized notebook

In practice, you will have already created a notebook that you now want to parametrize. However, for the sake of illustration, we're going to define our parameters first, then write some code which uses our parameters to generate some output.

To parametrize our notebook, we must

  1. Designate a single cell with the tag parameters. (Note: This cell need not be the first cell in the notebook, but should be located above any cell which depends on the parameters.)
  2. Define some parameters within the designated cell and give them default values.

Execute a Notebook

There are two ways to execute notebooks:

  1. through the Python API (import papermill as pm), or
  2. from the command line

Python API:

The Python API can be used to intuitively iterate over any number of parameter combinations.

import papermill as pm

for i in [1, 2, 3]:
    for j in [10, 20, 30]:
        pm.execute_notebook(
            "multiple-params.ipynb",
            f"output-notebooks/multiple-params-{i}{j}.ipynb",
            parameters = dict(int1=i, int2=j)
        )

Command-line interface:

Suppost you have multiple YAML files, each specifying a particular set of parameters you want to execute with. The following command will run papermill on each parameter file and inject the name of that parameter file into the output notebook's filename.

ls *.yaml | xargs -n1 -I {} papermill -f {} input.ipynb output/output-{}.ipynb

You can also create nested for loops in bash to specify a grid of parameters on which to execute.

for i in 1 2 3; do
    for j in 4 5 6; do
        papermill input.ipynb output/output$i$j.ipynb -p p1 $i -p p2 $j;
    done;
done;

Can we only parametrize numbers, strings, lists, dicts?

Example use cases

  1. Hyperparameter search for MNIST-trained CNN

    1. Possible parameters:
      1. network depth
      2. loss function
      3. convolution parameters
        1. kernel size
        2. kernel initializer
        3. kernel regularizer
        4. padding
        5. activation
        6. data format (may affect training wall-time)
        7. use bias / bias initializer
      4. use augmentation
      5. use batch norm
      6. use dropout
      7. dropout probability
      8. layer order (conv -> BN -> activation)
      9. flatten -> dense vs GAP -> dense
    2. Visualization:
      1. network diagram
      2. weights (histograms / normalized, single channel imshow)
      3. image at each layer
      4. metrics over time
      5. cross-validation results (mean and variance)
      6. failure analysis
  2. Company / Sector stock market analysis

    1. Parameters:
      1. company name / ticker symbol / sector label
      2. time interval
    2. Visualization:
      1. stock price
      2. trading volume
      3. headlines (web-scraped)
        1. reported earnings
        2. price split
        3. dividends
        4. acquisitions
      4. sentiment analysis
      5. performance against previous year(s)
      6. performance against companies in same sector
  3. Genetic Programming (specifically with gplearn)

    1. Parameters:
      1. target function (pickled!)
      2. function set
      3. many hyperparameters, e.g. mutation probabilities
    2. Visualization:
      1. fitness vs population size (fixed # generations)
      2. fitness over time (fixed pop. size)
      3. average program size over time (fixed pop. size)
      4. wall-time-per-generation over time