/arthur-redshift-etl

ETL Code for Redshift data warehouse

Primary LanguagePythonMIT LicenseMIT

ETL Code for Loading Data Into a Redshift-based Data Warehouse

                _   _                  _____          _     _     _  __ _     ______ _______ _
     /\        | | | |                |  __ \        | |   | |   (_)/ _| |   |  ____|__   __| |
    /  \   _ __| |_| |__  _   _ _ __  | |__) |___  __| |___| |__  _| |_| |_  | |__     | |  | |
   / /\ \ | '__| __| '_ \| | | | '__| |  _  // _ \/ _` / __| '_ \| |  _| __| |  __|    | |  | |
  / ____ \| |  | |_| | | | |_| | |    | | \ \  __/ (_| \__ \ | | | | | | |_  | |____   | |  | |____
 /_/    \_\_|   \__|_| |_|\__,_|_|    |_|  \_\___|\__,_|___/_| |_|_|_|  \__| |______|  |_|  |______|

Arthur is an ETL tool for managing a data warehouse in the AWS ecosystem. Arthur is designed to manage a warehouse in full-rebuild mode where the entire warehouse is rebuilt, from scratch, every night. Arthur is not designed to support streaming or micro-batch ETLs. Arthur is best suited for organizations whose data are managed in a stateful transactional database and have lots of complicated business logic for their data transformations that they want to be able to manage effectively.

If you’re interested in this approach or are in a similar situation, then we’d like to talk to you. Please reach out and let’s have a data & analytics meetup.

This README outlines how to get started with the ETL'ing and basic principles. This includes information about setting up Arthur which is the driver for ETL activities.

You are probably (also) looking for the wiki pages, which include a lot more information about the ETL and what it does (and why it does what it does). And if something appears amiss, check out the issues page.

Installing the code

See the separate INSTALL.md file.

Documentation

See also our wiki pages. And here's a presentation about Arthur, given at the Startup booth during the AWS Summit in New York.

Configuring the ETL (and upstream sources)

The best approach is to have a separate repo for your data warehouse that contains the configuration files and all the table design files and transformation code in SQL. The documentation will in many places assume that you have a "sibling" repo so that when within the repo for your local data warehouse (with configuration, credentials, and table designs), you can simply use ../arthur-redshift-etl/ to find your way back to this ETL code.

Redshift cluster and users

Although the Redshift cluster can be administered using the AWS console and psql, some helper scripts will make setting up the cluster consistently much easier. (See below for initialize and create_user.)

Also, add the AWS IAM role that the database owner may assume within Redshift to your settings file so that Redshift has the needed permissions to access the folder in S3. (And don't forget to add the role to the list of known IAM roles in Redshift.)

Sources

See the wiki pages about a description of configurations.

Running the ETL (arthur.py)

General notes about the CLI

  • Commands will provide usage information when you use -h.
    • There is also a help command that provides introductions to various topics.
  • Commands need configuration files. They will pick up all files in a local ./config directory or from whatever directory to which DATA_WAREHOUSE_CONFIG points.
  • Commands accept a --dry-run command line flag to test without modifying the environment.
  • Most commands allow the use of glob patterns to select specific schema(s) or table(s).
  • Most commands use --prefix to select a folder in the S3 bucket.
    • A few development commands normally pick up local files first and you need to add --remote to go to S3.
  • To pick a prefix without specifying it every time, set the environment variable ARTHUR_DEFAULT_PREFIX.
  • Log files are by default in arthur.log. They are rotated and deleted so that your disk doesn't fill up too much.
  • Logs that are collected from data pipelines are in stderr.gz or StdErr.gz files since Arthur logs to stderr.
  • To see more log information, use --verbose. To see less, use --quiet.
  • To see them formatted in the console the same way as they are formatted in the log files, use --prolix.
  • You could copy data manually, but you probably shouldn't and let arthur.py sync manage files.
  • You can use environment variables to pass in credentials for database access, but you should use a file for that.

Prerequisites for running the ETL in a cluster

Creating a credentials file

All credentials can be picked up from environment variables by the ETL. Instead of setting these variables before starting the ETL, you can also add a file with credentials to the config directory where the ETL will pick them up for you. The credentials file should be formatted just like a shell file would be to set variables, meaning lines should have the form:

# Lines with '#' are ignored.
NAME=value
# Although not meaningful within the ETL code, you can use the "export" syntax from Bash
export NAME=value

The minimal credentials file contains the login information for the ETL user that Arthur will use to execute in Redshift. Make sure this file exists in your data warehouse repo as config/credentials.sh:

DATA_WAREHOUSE_ETL=postgres://etl:<password>@<host>:<port>/<dbname>?sslmode=require

If you need to make changes in the cluster beyond schema changes, you will also need an admin:

DATA_WAREHOUSE_ADMIN=postgres://admin:<password>@<host>:<port>/<dbname>?sslmode=require

Starting Arthur in Docker

The INSTALL.md file will explain how to setup a Docker image to run Arthur. Once you have that, getting to a prompt is easy:

bin/run_arthur.sh ../warehouse-repo/config production

This command will set the path to the configuration files and default environment (a.k.a. prefix) for you.

Copying code into the S3 bucket

Copy the ETL code (including bootstrap scripts and configuration) is as simple as:

upload_env.sh

For this to work, you have to set the object_store in one of your configuration files.

Starting a cluster and submitting commands

Start a cluster:

launch_emr_cluster.sh

Now check for the output and pick up the cluster ID. There will be a line that looks something like this:

+ CLUSTER_ID=j-12345678

You can then use arthur.py --submit "<cluster ID>" instead of arthur.py in the examples below. Note that the --submit option must be between arthur.py and the sub-command in use, e.g.

arthur.py --submit "<cluster ID>" load --prolix --prefix $USER

Don't worry -- the script launch_emr_cluster.sh will show this information before it exits.

Initializing the Redshift cluster

Sub-command Goal
initialize Create schemas, groups and users
create_user Create (or configure) users that are not mentioned in the configuration file
# The commands to setup the data warehouse users and groups or any database is by ADMIN (connected to `dev`)
arthur.py initialize
arthur.py initialize development --with-user-creation  # Must create users and groups on first call

Starting with design files (and managing them)

Sub-command Goal
bootstrap_sources Download schemas from upstream sources and bootstrap design files
bootstrap_transformations Bootstrap (or update) design files for transformations based on new SQL queries
explain Review query plan for transformations
validate After making changes to the design files, validate that changes are consistent with the expected format and with respect to each other
sync Upload your local files to your data lake
# This will upload local files related to one schema into your folder inside the S3 bucket:
arthur.py sync "<schema>"
# This will upload local files for just one table
arthur.py sync "<schema>.<table>"

Note that when running sync that involved changes of source schemas or configurations, you must use:

arthur.py sync --force --deploy "<schema>.<table>"

Deploying into production

We prefer to have a short and succinct way to deploy our data warehouse files (configuration, design files and transformations) into production. So instead of starting a bash and running sync, just do:

bin/deploy_with_arthur.sh -p aws-prod-profile ../repo/config_directory/ production

Loading and updating data

Sub-command Goal
extract Get data from upstream sources (databases or S3)
load, upgrade Make data warehouse "structural" changes and let data percolate through
update Move data from upstream sources and let it percolate through
unload Take data from a relation in the data warehouse and extract as CSVs into S3
arthur.py extract
arthur.py load  # This will automatically create schemas and tables as necessary

Note that when a load fails, the work until the failed relation is still in the "staging" schemas. You can continue the load after fixing any query errors or input data, using:

arthur.py upgrade --with-staging-schemas --continue-from failed_relation.in_load_step

Within a production environment, check out install_pizza_pipeline.sh which provides a consistent interface.

Dealing with schemas (create, restore)

Sub-command Goal
create_schemas Create schemas; normally load will do that for you
promote_schemas Bring back schemas from backup if load was aborted or promote staging after fixing any issues

To test permissions (granting and revoking), use this for any schema:

arthur.py create_schemas schema_name
arthur.py create_schemas --with-staging schema_name
arthur.py promote_schemas --from staging schema_name

Working with subsets of tables

Sub-command Goal
show_downstream_dependents Inspect the other relations impacted by changes to the selected ones
show_upstream_dependencies Inspect which other relations feed the selected ones

The patterns used by commands like extract or load may be provided using files. Together with show_downstream_dependents and show_upstream_dependencies, this opens up opportunities to work on a "sub-tree" of the data warehouse.

Working with just the source schemas or transformation schemas

At the beginning it might be worthwhile to focus just on tables in source schemas -- those tables that get loaded using CSV files after extract.

arthur.py show_downstream_dependents -q | grep 'kind=DATA' | tee sources.txt

# List CSV files and manifests, then continue with upgrade etc.
arthur.py ls --remote @sources.txt

Note that you should use the special value :transformations when you're interested to work with transformations.

Example:

arthur.py show_downstream_dependents -q --continue-from=:transformations | tee transformations.txt

arthur.py sync @transformations.txt
arthur.py upgrade --only @transformations.txt

# If you don't make changes to transformations.txt, then you might as well use
arthur.py upgrade --continue-from=:transformations

Working with a table and everything feeding it

While working on transformations or constraints, it might be useful to focus on just a set of of tables that feed data into it.

Example:

arthur.py show_upstream_dependencies -q www.users | tee www_users.txt

arthur.py sync www.users
arthur.py upgrade --only @www_users.txt

Using configuration values to fill in templates

AWS service templates can be filled out based on configuration in the ETL.

Sub-command Goal
render_template Return a JSON document that has values (like ${resources.vpc.id}) filled in
show_value Show value of a single variable based on current configuration
show_vars Show variables and their values based on current configuration

Note this leaves references like #{parameter}, which are used by AWS tools, in place.

Example:

arthur.py render_template ec2_instance
arthur.py show_value object_store.s3.bucket_name
arthur.py show_vars object_store.s3.*

Working with a staging environment

A staging environment can help with deploying data that you'll be confident to release into production.

arthur.py initialize staging
arthur.py initialize staging --dry-run  # In case you want to see what happeens but not lose all schemas.

Once everything is working fine in staging, you can promote the code into production.

sync_env.sh "<your S3 bucket>" staging production

Don't forget to upload any credentials_*.sh or environment.sh files as needed for production.

Contributing and Releases

Creating pull requests

Pull requests are welcome!

Development takes place on the next branch. So go ahead, and create a branch off next and work on the next ETL feature.

Formatting code

pycodestyle python
  • Please sort your imports using isort:
isort --settings-path ./etc --recursive python/
  • Please format your code using black:
black --config ./etc/pyproject.toml python/

Adding a pre-commit hook

Use the the git pre-commit hook to run a PEP8 check automatically:

ln -s -f ../../githooks/pre-commit ./.git/hooks/pre-commit

References

Releasing new versions

Here are the basic steps to release a new version. Appropriate as you deem appropriate.

Creating patches

For minor updates, use a PR to update next. Anything that requires integration tests or running code in development cluster first should go through a release candidate.

Creating a release candidate

  • Create a new branch for the release candidate, e.g. v1_2_0 for v1.2.0. (Do yourself a favor and use underscores in branch names and periods in the tag names.)

  • Your first commit on this branch is a bump in the version number in setup.py.

  • Create a pull request for this new branch, add a message to state that it is

    "Release Candidate for v1.2.0"
    .

  • Go through open pull requests that are ready for release and change their base branch to your release branch (in our example, v1_2_0).

    • Make sure the PR message contains the Issue number or the Jira story (like DW-99) or the Issue number from this repo, as applicable.
    • Add the changes from the story work into comments of the PR.
      • Consider changes that are user facing and make sure there's a summary for those
      • List all bug fixes (especially when there are associated tickets)
      • Highlight internal changes, like changes in data structures or added control flow
    • Then merge the ready PRs into your release candidate.
  • After testing, merge the PR with your release candidate into next.

  • Create a release under Releases.

    • Create a new release and set the release version, e.g. v1.2.0.
    • Copy the comments from the PR where you collected all the changes into the release notes.
    • Save the release which will add the tag of the release.
  • Ship the new version using upload_env.sh in development. Wait at least a day before promoting to production.

Releasing code to master branch

Once code is considered ready for production (and you've made sure there's an updated version number in setup.py):

  • Merge next into master
git checkout master
git pull
git merge origin/next
git push
  • Then merge master back into next to ensure any hotfixes on master get picked up:
git checkout next
git pull
git merge --no-ff origin/master
git push

Tips & Tricks

Miscellaneous

Using command completion in the shell

For the bash shell, there is a file to add command completion that allows to tab-complete schemas and table names.

source etc/arthur_completion.sh

(Within a Docker image, that happens automatically.)

iPython and q

Consider installing iPython:

pip3 install ipython

Also, q comes in handy for debugging:

pip3 install q

EMR login / EC2 login

You can use the .ssh/config file to pick the correct user (hadoop) for the cluster and to automatically pick up a key file. Replace the IdentityFile value with the location of your key pair file.

Host ec2-*.amazonaws.com
  ServerAliveInterval 60
  User hadoop
  IdentityFile ~/.ssh/dw-dev-keypair.pem

If you find yourself using a one-off EC2 instance more often than an EMR cluster, change the User:

  User ec2-user

For information about maintaining the virtual environment, see the INSTALL.md file.