This repository contains one Meltano project that shows off some good practices on how to use Jupyter inside Meltano to both, analyze data as well as to use Jupyter notebooks as transformation tooling inside an ELT process.
Use batect to run everything completely dockerized. It already contains a notebook with some transformation logic inside.
As of Sep-2022, Jupyter is not yet available as utility on the hub, so you need to add it to your Meltano project as (local) custom plugin. Even if it is available as plugin on the hub, you will probably want to do some customization, so it will make sense to create your own version.
The steps involved are:
- Add a (local) custom Jupyter utility
- Add potential Python libraries you will need
- Optional: Add database connection variables to be exposed into the environment
- Add Papermill or nbconvert to execute your data transformations & customize commands for jupyter to run the notebook
- Execute your notebooks & add them to a schedule
Steps 1-4 are customizing the meltano.yml to suit your setup, if you're comfortable with the process, read through them, and go straight for step 4., copy & customize the yaml block from there, and you're set to go.
You can add a custom plugin either via the CLI or using the YAML file. As jupyter serves multiple purposes, the type "utility" is recommended as plugin type.
For Jupyter, you can choose both the "classi notebook" installed via the pip-package "jupyter" or the newer JupyterLab installed via the pip-package jupyterlab.
Note: The code snippets will use jupyterlab, if you want the classic notebook, just replace jupyterlab=>jupyter, the "executable" stays the same.
Via the YAMl file, add the follow code block inside your meltano.yml:
plugins:
utilities: # meltano invoke jupyter will start up the lab...
- name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab
executable: jupyter
commands:
launch_ip0: #important for Mac users running Meltano inside Docker.
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server
Run meltano install
to ensure the correctness of your yaml file. Then run meltano invoke jupyterlab:launch
to launch the GUI.
Using the command line, you can also run meltano add --custom utility jupyterlab
and interactively fill out these properties.
To work with Jupyter notebooks, you will end up using additional Python libraries which will generally fall into three categories
- helper libraries like matplotlib or pandas
- connection libraries like sqlAlchemy (and psycopg2)
- And nbconvert or papermill to execute the notebook. These are handled in step 4.
To add additional Python libraries, extend the meltano.yml definition by space-separated pip-package names behind jupyterlab
. An example:
plugins:
utilities: # meltano invoke jupyter will start up the lab...
- name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab pandas matplotlib sqlalchemy psycopg2-binary
executable: jupyter
commands:
launch_ip0: #important for Mac users running Meltano inside Docker.
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server
To connect to datasources across different plugins it is useful to expose the connection details using environment variables. Meltano is able to do so in the meltano.yml. Here is an example configuration using plain text connection details:
version: 1
default_environment: dev
environments:
- name: dev
config:
env:
PG_HOST: postgres
PG_PORT: "5432"
PG_DB: demo
PG_USER: admin
PG_PWD: password
You can read more on Meltano and environment variables here. These variables will then be accessible inside your Jupyter notebooks. E.g.
import os
PG_HOST = os.getenv("PG_HOST", default=None)
PG_PORT = os.getenv("PG_PORT", default=None)
PG_DB = os.getenv("PG_DB", default=None)
PG_USER = os.getenv("PG_USER", default=None)
PG_PWD = os.getenv("PG_PWD", default=None)
Whether your notebook is outputting machine learning accuracy data, a business intelligence report or is transforming data inside your data warehouse, you will likely want to be able to execute it without the GUI, using either:
- a CLI
- or automatic via the Meltano scheduler
. Jupyter and jupyterlab both offer "nbconvert" as default option to execute notebook. You can also use the additional Python package papermill. Papermill allows you to execute and parametrize notebooks for execution, while nbconvert only does execution.
If you want to use "nbconvert" you will want to add a command to the plugin, replace "notebook/sql_magic.ipynb" with your notebook path:
- name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab pandas matplotlib sqlalchemy psycopg2-binary papermill
executable: jupyter
commands:
launch_ip0:
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server
execute:
args: nbconvert --to notebook --execute notebook/sql_magic.ipynb
description: Start lab server
. You can then run an execution using meltano invoke jupyterlab:execute
If you want to use papermill, the easiest option is to use plugin inheritance to reuse the venvs created for each plugin. That way, you will not need to install jupyterlabs and all the dependencies twice. Here's an example yaml block:
- name: papermill
inherit_from: jupyterlab
executable: papermill
commands:
execute:
args: notebook/sql_magic.ipynb output/output.ipynb -p price_1 1000
description: Start lab server, on any ip range for Mac users inside docker.
You will need to adapt the "args" to your notebook path, output path and parameters. The example notebook here has one cell with a defined parameter "price_1" which we are able to override from the outside. For details, refer to the (pleasently short) documentation from papermill, it's a simple process.
Putting it all together, you will end up with a meltano.yml like this:
plugins:
utilities: # meltano invoke jupyter will start up the lab...
- name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab pandas matplotlib sqlalchemy psycopg2-binary papermill
executable: jupyter
commands:
launch_ip0:
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server
execute:
args: nbconvert --to notebook --execute notebook/sql_magic.ipynb
description: Start lab server
- name: papermill
inherit_from: jupyterlab
executable: papermill
commands:
execute:
args: notebook/sql_magic.ipynb output/output.ipynb -p price_1 1000
description: Start lab server, on any ip range for Mac users inside docker.
Then execute meltano invoke papermill:execute
to run your notebook and possibly include it in your meltano pipeline.