/alphasql

AlphaSQL provides Integrated Type and Schema Check and Parallelization for SQL file set mainly for BigQuery

Primary LanguageC++Apache License 2.0Apache-2.0

AlphaSQL

release test

AlphaSQL provides Automatic Parallelization for sets of SQL files and integrated Type/Scheme Checker to eliminate syntax, type and schema errors from your datawarehouse.

Features

You can quickly introduce AlphaSQL by CI Example.

Docker Image

You can run commands below with docker

docker run --rm -v `pwd`:/home matts966/alphasql:latest [command]

like

docker run --rm -v `pwd`:/home matts966/alphasql:latest alphacheck ./samples/sample/dag.dot

Commands are installed in the PATH of the image.

Fast Binaries

# To install for MacOSX
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_darwin_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_darwin_x86_64.tar.gz -C /usr/local/bin
# To install for Linux
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_linux_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_linux_x86_64.tar.gz -C /usr/local/bin --strip=1

Extract DAG from SQL set

alphadag finds dependencies between table references and create table statements, function calls and create function statements.

# To extract DAG from your SQL set
$ alphadag --output_path ./samples/sample/dag.dot ./samples/sample/

# Or you can check the output in stdout by
$ alphadag [paths]

# with graphviz
$ dot -Tpng samples/sample/dag.dot -o samples/sample/dag.png

Note that sometimes the output has cycle, and refactoring SQL files or manual editing of the dot file is needed (see this issue).

If there are cycles, warning is emitted, type checker reports error, and bq_jobrunner raise error before execution. You can see the example in ./samples/sample-cycle .

If you want to serially execute some statements, you can write SQL script that contains multiple statements. See samples/sample/create_interim1.sql as an example.

Sample DAG output

The image below is extracted from SQL set in ./samples/sample . You can write tests for created tables and run them parallely only by separating SQL file.

dag.dot

Side effect first

You can resolve side effects such as INSERT and UPDATE statements before simple references by the --side_effect_first option.

dag.dot

With tables and functions

You can extract dependencies containing tables and functions by --with_tables and --with_functions options.

With tables

dag.dot

With functions

dag.dot

Parallel Execution

For BigQuery, the output DAG can be run parallely using

bq-airflow-dag-generator

airflow DAG

bq-airflow-dag-generator as Python package can generate Airflow DAG by simple Python script.

dagpath = "/path/to/dag.dot"
dag = generate_airflow_dag_by_dot_path(dagpath)

See usage on README for more details.

bq-jobrunner

from bq_jobrunner.bq_jobrunner import BQJobrunner

FILE_PATH = "./path/to/dag.dot"
PROJECT_ID = "your-project-id"
REGION = "asia-northeast1" # your region

runner = BQJobrunner(
    PROJECT_ID,
    location=REGION,
)
runner.compose_query_by_dot_path(FILE_PATH)
runner.execute()

Note that you should run job_runner in the same path as in extracting DAG.

Pipeline level Type Check for SQL set

Note that you should run type_checker in the same path as in extracting DAG.

# to check type and schema of SQL set
$ alphacheck ./samples/sample.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interim2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interium1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/insert_into_interim1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_mart.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart3.sql"
SUCCESS: analysis finished!
Successfully finished type check!

If you change column x's type in ./samples/sample/create_datawarehouse3.sql to STRING, type checker reports error.

$ alphacheck ./samples/sample/dag.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
ERROR: INVALID_ARGUMENT: Column 1 in UNION ALL has incompatible types: INT64, STRING [at ./samples/sample/create_interim2.sql:7:1]
catalog:
        datawarehouse3
        datawarehouse2

Schema specification by JSON

You can specify external schemata (not created by queries in SQL set) by passing JSON schema path.

# with external schema
$ alphacheck --json_schema_path ./samples/sample-schema.json ./samples/sample/dag.dot

You can extract required external tables by

$ alphadag --external_required_tables_output_path ./required_tables.txt {./path/to/sqls}
# and get schemata using bq command
$ cat ./required_tables.txt | while read line
do
    bq show \
        --schema \
        --format=prettyjson \
        $line
done

JSON schema file should have only a top level map element keyed by string element, which specifies table name, and each value specifies schema for its key. name and type elements in the schema elements are recognized like the official API.

{
        "tablename1": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null}
        ],
        "tablename2": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null},
                {"mode": "NULLABLE", "name": "column2", "type": "INT64", "description": null}
        ]
}

CI Example

The pipeline level type check above is also useful in CI context. The sample in ./samples/sample-ci contains an example for extracting DAG, retrieving schema and checking schema and type of SQL set quering bigquery public dataset. You can introduce the CI to your environment only by copying cloudbuild_ci_sample.yaml and python_entrypoint.py to your project.

You can try the example CI with gcloud command by

(cd ./samples/sample-ci && gcloud builds submit --config=cloudbuild_ci_sample.yaml .)

This example

  • Supports _TABLE_SUFFIX feature!
  • Does not execute actual BigQuery and very cheap!

License

Apache License 2.0

Sponsors

The development of this project is sponsored by Japan Data Science Consortium and Cybozu Lab.