Data Lineage for Databases and Data Lakes

data-lineage is an open source application to query and visualize data lineage in databases, data warehouses and data lakes in AWS and GCP.

data-lineage's goal is to be fast, simple setup and allow analysis of the lineage. To achieve these goals, data lineage has the following features :

Generate data lineage from query history. Most databases maintain query history for a few days. Therefore the setup costs of an infrastructure to capture and store metadata is minimal.
Use networkx graph library to create a DAG of the lineage. Networkx graphs provide programmatic access to data lineage providing rich opportunities to analyze data lineage.
Integrate with Jupyter Notebooks. Jupyter Notebooks provide an excellent IDE to generate, manipulate and analyze data lineage graphs.
Use Plotly to visualize the graph with rich annotations. Plotly provides a number of features to provide rich graphs with tool tips, color coding and weights based on different attributes of the graph.

Checkout an example data lineage notebook.

Use Cases

Data Lineage enables the following use cases:

Business Rules Verification
Change Impact Analysis
Data Quality Verification

Check out the post on using data lineage for cost control for an example of how data lineage can be used in production.

Quick Start

# Install packages
pip install data-lineage
pip install jupyter

jupyter notebook

# Checkout example notebook: http://tokern.io/docs/data-lineage/example/

Supported Technologies

Postgres
AWS Redshift
Snowflake

Coming Soon

MySQL
SparkSQL
Presto

Documentation

For advanced usage, please refer to data-lineage documentation

Survey

Please take this survey if you are a user or considering using data-lineage. Responses will help us prioritize features better.

Developer Setup

# Install dependencies
pipenv install --dev

# Setup pre-commit and pre-push hooks
pipenv run pre-commit install -t pre-commit
pipenv run pre-commit install -t pre-push

dorianj/data-lineage