/mobydq

:whale: Tool to automate data quality checks on data pipelines

Primary LanguagePythonApache License 2.0Apache-2.0

MobyDQ

MobyDQ is a tool for data engineering teams to automate data quality checks on their data pipeline, capture data quality issues and trigger alerts in case of anomaly, regardless of the data sources they use.

Data pipeline

This tool has been inspired by an internal project developed at Ubisoft Entertainment in order to measure and improve the data quality of its Enterprise Data Platform. However, this open source version has been reworked to improve its design, simplify it and remove technical dependencies with commercial software.


Getting Started

Skip the bla bla and run your data quality indicators by following the Getting Started page. The complete documentation is also available on Github Pages: https://mobydq.github.io.


Requirements

Install Docker

This tool has been fully containerized with Docker to ensure easy deployment and portability. To add the Docker repository to your Linux machine, execute the following commands in a terminal window.

$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

Install Docker Community Edition.

$ sudo apt-get update
$ sudo apt-get install docker-ce

Add your user to the docker group to setup its permissions. Make sure to restart your machine after executing this command.

$ sudo usermod -a -G docker <username>

Install Docker Compose

Execute the following command in a terminal window.

$ sudo apt install docker-compose

Setup Your Instance

Create Configuration Files

Based on the template below, create a text file named .env at the root of the project. This file is used by Docker Compose to load configuration parameters into environment variables. This is typically used to manage file paths, logins, passwords, etc. Make sure to update the postgres user password for both POSTGRES_PASSWORD and DATABASE_URL parameters.

# DB
POSTGRES_USER=postgres
POSTGRES_PASSWORD=password

# GRAPHQL
DATABASE_URL=postgres://postgres:password@db:5432/mobydq

# SCRIPTS
GRAPHQL_URL=http://graphql:5433/graphql
MAIL_HOST=smtp.server.org
MAIL_PORT=25
MAIL_SENDER=change@me.com

# APP PARAMS
NODE_ENV=development
REACT_APP_GRAPHQL_API_URL=http://0.0.0.0:5433/graphql

Create Docker Network

This custom network is used to connect the different containers between each others. It is used in particular to connect the ephemeral containers ran when executing batches of indicators.

``


## Create Docker Volume
Due to Docker compatibility issues on Windows machines, we recommend to manually create a Docker volume instead of directly mounting external folders in `docker-compose.yml`. This volume will be used to persist the data stored in the PostgreSQL database. Execute the following command.
```shell
``


## Build Docker Images
Go to the project root and execute the following command in your terminal window.
```shell
$ cd mobydq
$ docker-compose build --no-cache

Run Docker Containers

To start all the Docker containers as deamons, go to the project root and execute the following command in your terminal window.

$ cd mobydq
$ docker-compose up -d db graphql api app

Individual components can be accessed at the following addresses:

Note access to GraphiQL and the PostgreSQL database is restricted by default to avoid intrusions. In order to access these addresses directly, you must run them with the following command to open their ports:

$ cd mobydq
$ docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d db graphql

Run Test Cases

To execute all test cases, execute following command from the project repository:

 to be documented

Dependencies

Docker Images

The containers run by docker-compose have dependencies with the following Docker images:

Python Packages

docker (3.5.0)

jinja2 (2.10.0)