data-lineage is an open source application to query and visualize data lineage in databases, data warehouses and data lakes in AWS and GCP.
data-lineage's goal is to be fast, simple setup and allow analysis of the lineage. To achieve these goals, data lineage has the following features :
- Generate data lineage from query history. Most databases maintain query history for a few days. Therefore the setup costs of an infrastructure to capture and store metadata is minimal.
- Use networkx graph library to create a DAG of the lineage. Networkx graphs provide programmatic access to data lineage providing rich opportunities to analyze data lineage.
- Integrate with Jupyter Notebooks. Jupyter Notebooks provide an excellent IDE to generate, manipulate and analyze data lineage graphs.
- Use Plotly to visualize the graph with rich annotations. Plotly provides a number of features to provide rich graphs with tool tips, color coding and weights based on different attributes of the graph.
Checkout an example data lineage notebook.
Data Lineage enables the following use cases:
- Business Rules Verification
- Change Impact Analysis
- Data Quality Verification
Check out the post on using data lineage for cost control for an example of how data lineage can be used in production.
# Install packages
pip install data-lineage
pip install jupyter
jupyter notebook
# Checkout example notebook: http://tokern.io/docs/data-lineage/example/
- Postgres
- AWS Redshift
- Snowflake
- MySQL
- SparkSQL
- Presto
For advanced usage, please refer to data-lineage documentation
Please take this survey if you are a user or considering using data-lineage. Responses will help us prioritize features better.
# Install dependencies
pipenv install --dev
# Setup pre-commit and pre-push hooks
pipenv run pre-commit install -t pre-commit
pipenv run pre-commit install -t pre-push