/ci-for-data-in-bigquery

Primary LanguagePythonApache License 2.0Apache-2.0

CI for Data in BigQuery CLI utility

This repo contains 2 CLI utilities that will help to orchestrate CI/CD workflow for data in BigQuery.

The first utility, create-dev-env will help orchestrate the process of creating Table Snapshots and Table Clones for the purpose of implementing changes inside the BigQuery Data Warehouse. This tool automates parts of the process described in the document CI For Data in BigQuery. The tool was written more for demonstration purposes and is probably worth customizing before putting to use in production.

The second utility run-tests will run a set of predefined tests, written in SQL to test the data. The sql test statements are written in a way that are meant to fail if the test fails. The queries are loaded and before execution, the target table is replaced (depending on runtime arguments) in order to either test the original table or the test table. See the section Writing Tests

Prerequisites

  • Python 3+ (tested using Python 3.8.12 & 3.9.9)
  • virtualenv (Recommended)

Authentication

To run the utility you will need to authenticate using a service account. This can be accomplished either by running inside an authorized VM on GCP, or setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to a valid keyfile. See more at Authenticating as a service account.

Required Permissions

For the base dataset, where the original tables are residing:

  • bigquery.datasets.get
  • bigquery.datasets.getData
  • bigquery.tables.list

To create a dataset:

  • bigquery.datasets.create

On the dataset where the table clones and snapshots will reside:

  • bigquery.tables.create
  • bigquery.tables.createSnapshot

To run tests against the test tables, on the new dataset:

  • bigquery.tables.getData

The process

A Gif is worth a thousand words

Example run

In words

The create-dev-end CLI utility will ask for several inputs in order to coordinate a snapshot and clone creation, on which the developer can start implementing their changes in isolation. All clones and snapshots will be a "point-in-time". On activation, the script will ask for the following:

  1. A source project to select from a list of available projects (as assigned to the service account).
  2. A source Dataset, from a list of datasets available in the source project.
  3. A list of source tables. Multiple selection is possible, using the space key to select or deselect the options. The list of tables is fetched from the selected source dataset.
  4. A target dataset. Either an existing one, or select the option to create a new dataset. In which case, the user will be prompted to enter a new dataset name. Validation of the new name is applied.
  5. A datetime format, that will be used as a "Point-in-Time" for snapshot and clone creation. This must be a point in time in the last 7 days.
  6. A final confirmation of all the details.

Upon confirmation, the script will create the target dataset if required, and create a snapshot and a clone for each table in the list provided. Each snapshot name will be in the form of snap_<DATETIME>_<SOURCE_TABLE_NAME> and each clone name will be in the format of clone_<DATETIME>_<SOURCE_TABLE_NAME>, where DATETIME will be in the format of 4 digits for the year and 2 digits for month, day, hour, minute & second. e.g: for a source table of the name foo, a snapshot might be named snap_20220317151941_foo and the corresponding clone will be named clone_20220317151941_foo.

Writing Tests

In the sql_tests directory, you can place multiple sql files. In each file, the sql statement should be designed to throw an error in case the test should fail. See Debugging Functions and Debugging Statements for more information.

In each statement, the queried tables should be written with a dollar sign ('$') prefix and curly braces surrounding the fully qualified name of the original table ('{foo}'). For example, assume the following SQL statement:

ASSERT (
    (SELECT COUNT(*) 
    FROM dataset1.bar as b 
    LEFT JOIN dataset2.foo as f 
        ON f.id = b.foo_id
    WHERE f.id IS NULL) = 0
) AS 'All bar records must have a valid foo id that corrosponds to a record in the foo table'

This statement will run in BigQuery, joining dataset1.bar as a test table and the dataset2.foo table as a reference table.

ASSERT (
    (SELECT COUNT(*) 
    FROM ${dataset1.bar} as b 
    LEFT JOIN ${dataset2.foo} as f 
        ON f.id = b.foo_id
    WHERE f.id IS NULL) = 0
) AS 'All bar records must have a valid foo id that corrosponds to a record in the foo table'

If we want to replace the original dataset1.bar with a different table dev_dataset3.clone_of_bar at runtime to test our changes, we need to create a translation file. This comes in the form of a json file, where each key is the original table name (as appeared in the query) and the value is the new target table. In our example this would like this:

{
  "dataset1.bar": "dev_dataset3.clone_of_bar"
}

At runtime, we specify this file as a parameter to the script: run-tests --translation-file translations/new_tests_configuraitno.json sql_tests/ This will replace any instance of ${dataset1.bar} with our new table name. If a table is requested, but does not appear in the translation file, the parser will fallback on the original key, basically querying the original table (in cases we don't have changes to test on that table).

We can use different translation files ot control which tables are under test and which tables are not.