This repo contains the code for loading data to Mattermost's data warehouse, performing transformations and exporting data to other tools.
.
├── dags <--- Airflow DAGs. DAGs mostly use KubernetesOperator to run a job.
├── extract <--- Python scripts that extract data from various locations.
│ ├── pg_import <--- Generic utility that copies data from Snowflake to Postgres
│ └── s3_extract <--- Various utilities for importing data from S3 to Snowflake
├── k8s <--- Pod definitions for Pipelinewise
├── load
│ └── snowflake <--- Snowflake role definitions
├── README.md <--- This file
├── poetry.lock <--- Pinned dependency versions
├── pyproject.toml <--- PEP-518 build system requirements
├── requirements.txt <--- Dependencies (deprecated)
├── transform
│ ├── snowflake-dbt <--- Snowflake DBT models
│ ├── mattermost-analytics <--- New DBT project for Mattermost analytics.
│ └── sql <--- SQL scripts that get executed by DAGs
└── utils <--- Various Python scripts
Tools for extracting data
- from S3 to Snowflake and
- from Snowflake to Postgres.
All extractions are executed using Airflow on a Kubernetes cluster.
- DBT project for running transformations on Snowflake. The DBT project runs on DBT cloud.
- SQL scripts for NPS feedback. The scripts are executed by an Airflow DAG.
Snowflake role definitions. An Airflow DAG runs the update.
Trigger building invoice for subscriptions. Triggered via an Airflow DAG.
A set of Python scripts performing custom ETLs. The utilities run as part of Airflow DAGs.
Airflow DAGs that orchestrate ETL pipelines.
- Docker & docker compose
- Snowflake access with proper roles (ask the team which role to use)
Copy .dbt.env.example
to .dbt.env
. Edit the file and replace placeholder with appropriate values.
At the home of the repo run:
make dbt-bash
This command creates a container with dbt
pre-installed and connects you to the bash shell.
To test that everything is working as expected, try to generate dbt docs:
dbt docs generate -t prod
If the output looks similar to:
...
07:28:30 Found 289 models, 10 tests, 0 snapshots, 0 analyses, 407 macros, 0 operations, 15 seed files, 263 sources, 0 exposures, 0 metrics
07:28:30
07:29:00 Concurrency: 8 threads (target='prod')
07:29:00
07:32:04 Done.
07:32:04 Building catalog
07:44:19 Catalog written to /usr/app/target/catalog.json
then the setup is working.
Alternatively, you can run
make dbt-mattermost-analytics
to start a bash shell for the new project.
At the home of the repo run:
make dbt-docs
This command will generate the docs and serve them at http://localhost:8081.
All dbt commands that are part of the Makefile
use by default profiles under transform/snowflake-dbt/profile.
This directory can be overriden by setting DBT_PROFILE_PATH
environment variable:
DBT_PROFILE_PATH=/path/to/profile make dbt-bash
Note that
DBT_PROFILE_PATH
can use either absolute or relative paths. For relative paths, the base is the build directory of this project.
Run the following commands:
poetry install
# Clearbit cannot be installed as part of poetry dependencies, as it's a really old dependency.
poetry run pip install clearbit==0.1.7
Additional dependencies can be specified at pyproject.toml
. Poetry's documentation
provides examples. Please prefer using poetry
CLI, as it also updates poetry.lock
file and "pins" any new dependencies.
Note that currently there's a
requirements.txt
file. This file will be deprecated.
Snowflake connection details can be configured by adding the proper environment variables. Each role requires a different set of environment variables. The following table describes the required environment variables for each role:
Role | Environment variable | Description |
---|---|---|
SYSADMIN | SNOWFLAKE_USER | Username |
SNOWFLAKE_PASSWORD | Password | |
SNOWFLAKE_ACCOUNT | Snowflake account to connect to | |
SNOWFLAKE_LOAD_DATABASE | Database to load data to | |
SNOWFLAKE_LOAD_WAREHOUSE | Warehouse to load data to | |
ANALYTICS_LOADER | SNOWFLAKE_LOAD_USER | Username |
SNOWFLAKE_LOAD_PASSWORD | Password | |
SNOWFLAKE_ACCOUNT | Snowflake account to connect to | |
SNOWFLAKE_TRANSFORM_DATABASE | Database to load data to | |
SNOWFLAKE_LOAD_WAREHOUSE | Warehouse to load data to | |
LOADER | SNOWFLAKE_LOAD_USER | Username |
SNOWFLAKE_LOAD_PASSWORD | Password | |
SNOWFLAKE_ACCOUNT | Snowflake account to connect to | |
SNOWFLAKE_LOAD_DATABASE | Database to load data to | |
SNOWFLAKE_LOAD_WAREHOUSE | Warehouse to load data to | |
TRANSFORMER | SNOWFLAKE_USER | Username |
SNOWFLAKE_PASSWORD | Password | |
SNOWFLAKE_ACCOUNT | Snowflake account to connect to | |
SNOWFLAKE_TRANSFORM_DATABASE | Database to use for transforming data | |
SNOWFLAKE_TRANSFORM_WAREHOUSE | Warehouse to store transformed data to | |
PERMISSIONS | PERMISSION_BOT_USER | Username |
PERMISSION_BOT_PASSWORD | Password | |
PERMISSION_BOT_ACCOUNT | Snowflake account to connect to | |
PERMISSION_BOT_DATABASE | Database to use for transforming data | |
PERMISSION_BOT_WAREHOUSE | Warehouse to store transformed data to | |
RELEASE_LOCATION | Location to load release data from |
Please see the separate documentation file.