- Preparing Meltano
- Preparing Meltano for Production
Below is a guide on how to set up the Meltano project. Currently these are the taps/targets installed:
Setup of these taps/targets must be completed before the Meltano project is launched.
Due to the current implementation of tap-gmail in which it is tailored to GitLab, fork this repository in order to ensure changes to this repository do not arise unexpectedly.
The Google API Python Client documentation provides a guide for completing all necessary steps to ensure the application and environment are configured correclty. A GCP project is required for the setup process, and it needs to be the same project used in the production deployment process. The guide can be found here:
Using OAuth 2.0 for Server to Server Applications
The key steps in the guide are:
- Create a service account
- Delegate domain-wide authority to the service account
- Create and download a
json
service account key from the newly-created service account
In order for the Slack data extraction to work, you need to create a Slack app with the following scopes:
channels:history
channels:join
channels:read
files:read
groups:history
groups:read
links:read
pins:read
reactions:read
remote_files:read
team:read
usergroups:read
users.profile:read
users:read
users:read.email
Once the app has been installed in your workplace, keep note of the slack verification token
for the environment setup below.
Note: If you wish to get data from private channels, you must invite the bot to all private channels you wish to ingest.
With an account that has admin privileges, create a JWT app. During the setup, you must set the expiration of the JWT key. Since this app is scheduled to constantly run, it is recommended to set the expiration date as far in the future as your company will allow.
The following sections must be changed in the meltano.yml
file before installation:
- Change the
pip_url
undertap-gmail
togit+https://your-service.com/organization/your-repo.git
- Under the
settings
oftap-slack
, change the value ofstart_date
to a date you wish to sync from. - Under the
config
oftarget-snowflake
, changeaccount
,database
,warehouse
, androle
to your respective settings. For more information on the specific settings, read the Meltano Snowflake documentation
Once the above settings are changed, run the following command to install your environment:
# Without Docker
meltano install
# With Docker
docker run -v $(pwd):/projects -w /projects meltano/meltano:latest-python3.8 install
In order for the installed Meltano taps to function properly, the following environment varaiables must be set:
SF_USER="<USER>"
SF_PASSWORD="<PASSWORD>"
TAP_SLACK_TOKEN="<SLACK VERIFICATION TOKEN>"
TAP_ZOOM_JWT="<JWT KEY>"
# The service account key created during the `Tap Gmail` setup for the Admin SDK
ADMIN_SDK_KEY="/path/to/service_key.json"
AIRFLOW__CORE__SQL_ALCHEMY_CONN="<postgresql://username:password@localhost:5432/mydatabase>"
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT="120"
AIRFLOW__CORE__EXECUTOR="LocalExecutor"
For local development, the easiest way to set these environment variables is to make a .env
file with the necessary credentials in the root directory. The following environment variables can be ignored in the development environment:
AIRFLOW__CORE__SQL_ALCHEMY_CONN
- There is an internal SQLite database included with Meltano for development use
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
- Import timeout is not an issue on local systems
AIRFLOW__CORE__EXECUTOR
- The default
SequentialExecutor
works with SQLite
- The default
Once installed and the environment variables have been set up, you may run any command from the Meltano CLI
Scheduling is handled with Meltano's built-in Airflow. Currently, each tap is scheduled to sync every day. If you wish to change the frequency of individual syncs, you can change the value of interval
under the schedules
section within meltano.yml
. Refer to the Meltano documentation for presets and valid intervals.
To run the scheduler, run the following command
# Without Docker
meltano invoke airflow scheduler
# With Docker
docker run -v $(pwd):/projects -w /projects meltano/meltano:latest-python3.8 invoke airflow scheduler
Below are the steps to get your Meltano project ready for a production environment. The production deployment instructions are for Google Cloud Platform using the Google Cloud SDK, and covers the following services:
- Container Registry
- Cloud SQL (Postgres)
- Kubernetes Engine
The deployment implements Kubernetes Secrets for sensative environment variables, as well as a .json key file that is used to authorize the Cloud SQL proxy in a Kubernetes sidecar setup.
Once the taps have been set up and the meltano.yml
file has been changed, you can create a docker image to get ready for deployment to production. The provided Dockerfile
will build the image and start up the Meltano Scheduler.
To build the image, run the following command in the root directory:
docker build --tag <IMAGE NAME>:<TAG NAME> .
Documentation for building containers with Docker can be found here.
This section covers the necessary steps to create a GCP Production environment for the Meltano application. A GCP Project is a required, and it is reccomended that the Google Cloud SDK is installed in order to follow commands used in this guide. The following GCP APIs must be enabled:
- Admin SDK
- Kubernetes Engine
- Cloud SQL
- Cloud SQL Admin
- Container Registry
- Additional API dependencies will be created as necessary by the services listed above
- This is typical behavior for GCP APIs
The Admin SDK API and Reports API should already be enabled from the Gmail Setup section of this guide. If it is not enabled, please return to Gmail Setup and complete the setup process.
Admin SDK API and Reports API documentation can be found here.
The Cloud SQL and Cloud SQL Admin APIs are used to store Airflow metadata in the Meltano production environment. The Cloud SQL Admin API is necessary in order to create a Cloud SQL Proxy within a Kubernetes deployment.
Enable the Cloud SQL API
Enable the Cloud SQL Admin API
Create a Cloud SQL (PostgreSQL 12) instance if one does not already exist.
There are some things to keep track of from this setup that will be needed to customize the Kubernetes deployment file gitlab-app.yaml
:
- Cloud SQL instance name
- The connection name of the Cloud SQL instance
- Example:
your-projectid-123456:region:instance-name
- Example:
- The name of the database where the Airflow metadata will be stored
- The database name included in
gitlab-app.yaml
isairflow-meta
- A user name for the database
- The password of the user
- The database name included in
The Container Registry is used to store the Docker image that will be deployed in Kubernetes Engine. Artifact Registry can be used as an alternative, but the gcloud sdk
commands in this guide will be for Container Registry.
Enable the Container Registry API
Container Registry will need to be authenticated in your local environment using the Google Cloud SDK to push and pull images. This can be ignored if Container Registry has been previously authenticated for other local projects.
Google Container Registry docs can be found here.
Kubernetes Engine is used to deploy the production Meltano application. If there is an existing Kubernetes cluster that you would like to use for deploying Meltano skip the Create a GKE Cluster section.
Enable the Kubernetes Engine API
Kubernetes docs can be found here.
https://cloud.google.com/kubernetes-engine/docs/quickstart#choosing_a_shell
Select Local shell
, and then run the following command:
gcloud components install kubectl
https://cloud.google.com/kubernetes-engine/docs/quickstart#create_cluster
gcloud container clusters create cluster-name --num-nodes=1
Authenticate the Cluster
gcloud container clusters get-credentials cluster-name
The following guide section explains the process to create a service account key file and attach it to a Kubernetes secret, which will then be used by Cloud SQL Proxy for authentication:
https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine#service-account-key-file
- When creating Kubernetes secrets use the
--namespace=gitlab
flag
# Example
kubectl create secret generic <YOUR-DB-SECRET> --namespace=gitlab \
--from-literal=username=<YOUR-DATABASE-USER> \
--from-literal=password=<YOUR-DATABASE-PASSWORD> \
--from-literal=database=<YOUR-DATABASE-NAME>
- Ensure that the service account you create (or use) has Cloud SQL Admin API IAM access with the Editor role
- The
key.json
created for the service account will be saved as a Kubernetes secret- The
key.json
file should be stored in a secure location in case it needs to be accessed in the future
- The
- The
gitlab-app.yaml
file stores the value ofkey.json
to the environment variableADMIN_SDK_KEY
- The Kubernetes secret name is
admin-sdk
- The key is
service_key
- The value associated with the above key is
key.json
- The value associated with the above key is
- The key is
- The Kubernetes secret name is
The Kubernetes secret name used in gitlab-app.yaml
for the Cloud SQL credentials is:
airflow-db
- The key names within
airflow-db
are:username
- The user name for the Cloud SQL database created in this section of the guide
password
- The password for the Cloud SQL database user created in this section of the guide
database
- The database name for the Cloud SQL database created in this section of the guide
The Cloud SQL Admin API should be enabled from this section of the guide. If is not enabled, please proceed to Connecting using the Cloud SQL Proxy which explains why it is necessary to enable the Cloud SQL Admin API.
Create the other necessary Kubernetes secrets. If the secret names and keys that are created match what is in the gitlab-app.yaml
file it minimizes the edits required to customize the deployment.
All secrets must be created for the gitlab
namespace
# Example
kubectl create secret generic my-secret-name --from-literal user=admin --namespace=gitlab
- Secret Name
tap-secrets
- Keys
airflow_conn
postgresql://username:password@localhost:5432/mydatabase
sf_password
- Snowflake instance password
sf_user
- Snowflake instance user name
tap_slack_token
- Slack authentication token
tap_zoom_jwt
- Zoom JWT authentication token
- Secret Name
admin-sdk
- Keys
service_key
- The service account key created during the Gmail Setup
To deploy to Kubernetes upload the Docker image to Container Registry.
Step 1 - Tag the image:
docker tag img_name gcr.io/[PROJECT-ID]/img_name:tag
Step 2 - Push the image to Google Container Registry:
docker push gcr.io/[PROJECT-ID]/img_name:tag1
The gitlab-app.yaml
is used to create the Kubernetes deployment. To create a deployment on your Kubernetes cluster run:
kubectl apply -f ./gitlab-app.yaml
Update the gitlab-app.yaml
file as necessary before deploying.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gitlab-production # Name of the deployment
namespace: gitlab # Namespace of the deployment
spec:
replicas: 1 # We recommend 1 replica
selector:
matchLabels:
app: gitlab
template:
metadata:
labels:
app: gitlab
spec:
serviceAccount: k8s-sa # Replace with service account name created in the section 'Connect From GKE to Cloud SQL'
serviceAccountName: k8s-sa # Replace with service account created in the section 'Connect From GKE to Cloud SQL'
containers:
- name: gitlab
image: gcr.io/<your-projectid-123456>/<image-name:image-tag> # The Gitlab-Meltano image
imagePullPolicy: Always
env:
- name: DB_USER # Cloud SQL Postgres user name
valueFrom:
secretKeyRef:
name: airflow-db
key: username
- name: DB_PASS # Cloud SQL Postgres password for user
valueFrom:
secretKeyRef:
name: airflow-db
key: password
- name: DB_NAME # Cloud SQL Postgres database anem
valueFrom:
secretKeyRef:
name: airflow-db
key: database
- name: TAP_SLACK_TOKEN
valueFrom:
secretKeyRef:
name: tap-secrets
key: tap_slack_token
- name: TAP_ZOOM_JWT
valueFrom:
secretKeyRef:
name: tap-secrets
key: tap_zoom_jwt
- name: SF_USER
valueFrom:
secretKeyRef:
name: tap-secrets
key: sf_user
- name: SF_PASSWORD
valueFrom:
secretKeyRef:
name: tap-secrets
key: sf_password
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: tap-secrets
key: airflow_conn
- name: ADMIN_SDK_KEY
valueFrom:
secretKeyRef:
name: admin-sdk
key: service_key
- name: AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT # Set environment variable
value: "120"
- name: AIRFLOW__CORE__EXECUTOR # Set environment variable
value: "LocalExecutor"
- name: cloud-sql-proxy # Cloud SQL Proxy Sidecar
# It is recommended to use the latest version of the Cloud SQL proxy
# Make sure to update on a regular schedule!
image: gcr.io/cloudsql-docker/gce-proxy
command:
- "/cloud_sql_proxy"
# If connecting from a VPC-native GKE cluster, you can use the
# following flag to have the proxy connect over private IP
# - "-ip_address_types=PRIVATE"
# Replace DB_PORT with the port the proxy should listen on
# Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
# Copy and paste your SQL instance connection name into the code below
- "-instances=<your-projectid-123456:zone:instance-name>=tcp:5432"
# This flag specifies where the service account key can be found
- "-credential_file=/secrets/service_account.json" # The service account key for Cloud SQL Proxy
securityContext:
# The default Cloud SQL proxy image runs as the
# "nonroot" user and group (uid: 65532) by default.
runAsNonRoot: true
volumeMounts:
- name: k8s-sql-volume
mountPath: /secrets/
readOnly: true
volumes:
- name: k8s-sql-volume
secret:
secretName: cloud-sql # The secret that contains service_account.json for Cloud SQL Proxy