mattermost-data-warehouse

This repo contains the code for loading data to Mattermost's data warehouse, performing transformations and exporting data to other tools.

Repository structure

.
├── dags                    <--- Airflow DAGs. DAGs mostly use KubernetesOperator to run a job.
├── extract                 <--- Python scripts that extract data from various locations.
│  ├── pg_import            <--- Generic utility that copies data from Snowflake to Postgres
│  └── s3_extract           <--- Various utilities for importing data from S3 to Snowflake
├── k8s                     <--- Pod definitions for Pipelinewise
├── load        
│  └── snowflake            <--- Snowflake role definitions
├── README.md               <--- This file
├── poetry.lock             <--- Pinned dependency versions
├── pyproject.toml          <--- PEP-518 build system requirements
├── requirements.txt        <--- Dependencies (deprecated)
├── transform
│  ├── snowflake-dbt        <--- Snowflake DBT models
│  ├── mattermost-analytics <--- New DBT project for Mattermost analytics.
│  └── sql                  <--- SQL scripts that get executed by DAGs
└── utils                   <--- Various Python scripts

What does this repo contain?

Extract

Tools for extracting data

  • from S3 to Snowflake and
  • from Snowflake to Postgres.

All extractions are executed using Airflow on a Kubernetes cluster.

Transform

  • DBT project for running transformations on Snowflake. The DBT project runs on DBT cloud.
  • SQL scripts for NPS feedback. The scripts are executed by an Airflow DAG.

Load

Snowflake role definitions. An Airflow DAG runs the update.

Billing

Trigger building invoice for subscriptions. Triggered via an Airflow DAG.

Utils

A set of Python scripts performing custom ETLs. The utilities run as part of Airflow DAGs.

DAGs

Airflow DAGs that orchestrate ETL pipelines.

Running locally

Prerequisites

  • Docker & docker compose
  • Snowflake access with proper roles (ask the team which role to use)

Setup

Copy .dbt.env.example to .dbt.env. Edit the file and replace placeholder with appropriate values.

Starting a bash shell with DBT

At the home of the repo run:

make dbt-bash

This command creates a container with dbt pre-installed and connects you to the bash shell.

To test that everything is working as expected, try to generate dbt docs:

dbt docs generate -t prod

If the output looks similar to:

...

07:28:30  Found 289 models, 10 tests, 0 snapshots, 0 analyses, 407 macros, 0 operations, 15 seed files, 263 sources, 0 exposures, 0 metrics
07:28:30  
07:29:00  Concurrency: 8 threads (target='prod')
07:29:00  
07:32:04  Done.
07:32:04  Building catalog
07:44:19  Catalog written to /usr/app/target/catalog.json

then the setup is working.

Alternatively, you can run

make dbt-mattermost-analytics

to start a bash shell for the new project.

Building docs and serving them locally

At the home of the repo run:

make dbt-docs

This command will generate the docs and serve them at http://localhost:8081.

Overriding profile

All dbt commands that are part of the Makefile use by default profiles under transform/snowflake-dbt/profile. This directory can be overriden by setting DBT_PROFILE_PATH environment variable:

DBT_PROFILE_PATH=/path/to/profile make dbt-bash

Note that DBT_PROFILE_PATH can use either absolute or relative paths. For relative paths, the base is the build directory of this project.

Developing

Requirements

Install dependencies

Run the following commands:

poetry install

# Clearbit cannot be installed as part of poetry dependencies, as it's a really old dependency.
poetry run pip install clearbit==0.1.7

Adding dependencies

Additional dependencies can be specified at pyproject.toml. Poetry's documentation provides examples. Please prefer using poetry CLI, as it also updates poetry.lock file and "pins" any new dependencies.

Note that currently there's a requirements.txt file. This file will be deprecated.

Configuration

Snowflake connections

Snowflake connection details can be configured by adding the proper environment variables. Each role requires a different set of environment variables. The following table describes the required environment variables for each role:

Role Environment variable Description
SYSADMIN SNOWFLAKE_USER Username
SNOWFLAKE_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_LOAD_DATABASE Database to load data to
SNOWFLAKE_LOAD_WAREHOUSE Warehouse to load data to
ANALYTICS_LOADER SNOWFLAKE_LOAD_USER Username
SNOWFLAKE_LOAD_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_TRANSFORM_DATABASE Database to load data to
SNOWFLAKE_LOAD_WAREHOUSE Warehouse to load data to
LOADER SNOWFLAKE_LOAD_USER Username
SNOWFLAKE_LOAD_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_LOAD_DATABASE Database to load data to
SNOWFLAKE_LOAD_WAREHOUSE Warehouse to load data to
TRANSFORMER SNOWFLAKE_USER Username
SNOWFLAKE_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_TRANSFORM_DATABASE Database to use for transforming data
SNOWFLAKE_TRANSFORM_WAREHOUSE Warehouse to store transformed data to
PERMISSIONS PERMISSION_BOT_USER Username
PERMISSION_BOT_PASSWORD Password
PERMISSION_BOT_ACCOUNT Snowflake account to connect to
PERMISSION_BOT_DATABASE Database to use for transforming data
PERMISSION_BOT_WAREHOUSE Warehouse to store transformed data to
RELEASE_LOCATION Location to load release data from

Airflow

Please see the separate documentation file.