Performing ETL into Big Query Tutorial Sample Code

This is the sample code for the Performing ETL from a Relational Database into BigQuery using Dataflow. The tutorial explains how to ingest highly normalized (OLTP database style) data into Big Query using DataFlow. To understand this sample code it is recommended that you review the Apache Beam programming model.

This tutorial relies on the musicbrainz dataset.

Note that this tutorial assumes that either:

you are using Cloud Shell
you have Java 8, maven and Google Cloud SDK installed.

It also requires that you have a project, dataflow staging bucket and big query dataset already created.

Overview

The repository consists of the following Java classes:

com.google.cloud.bqetl.BQETLSimple - Simple pipeline for ingesting musicbrainz artist recording data as a flat table of artist's recordings.
com.google.cloud.bqetl.BQETLNested - Revision of the simple pipeline that nests the artist's recordings as a repeated record inside each Big Query table row which pertains to an artist.
com.google.cloud.bqetl.mbdata.MusicBrainzDataObject - a general purpose object to represent a row of MusicBrainz data
com.google.cloud.bqetl.mbdata.MusicBrainzTransforms - the library of functions that implements the transforms used in the above pipelines and allows them to be reused
com.google.cloud.bqetl.json.JSONReader - class for converting a JSON representation of a MusicBrainz tow into a MusicBrainzDataObject
com.google.cloud.bqetl.options.BQETLOptions - the options for the BQETL pipelines
com.google.cloud.bqetl.JSONReaderTest - test for the JSONReader
com.google.cloud.bqetl.mbdata.MusicBrainzTransformsTest - tests the transforms library

The repository consists of the following scripts and resources:

src/test/resources data files for the test classes
dataflow-staging-policy.json - a policy for expiring objects in the bucket used for staging the dataflow jobs
run.sh - example script for running the pipelines using maven
pom.xml - maven build script

Quickstart: Running the ETL Pipelines

NOTE: For more detail instructions, and pipeline details, see the full tutorial on Google Cloud Solutions.

Follow the instructions to set up your project and service account:

Dataflow using Java and Apache Maven: Before you begin

Set your environment variables.

For example:

export PROJECT_ID=[YOUR_PROJECT_ID]
export ZONE=[CHOOSE_AN_APPROPRIATE_ZONE] # e.g. us-east1-c
export STAGING_BUCKET=${PROJECT_ID}-etl-staging-bucket
export DATASET=musicbrainz

First, run the simple pipeline using the script run.sh:
```
export DESTINATION_TABLE=recordings_by_artists_dataflow
./run.sh simple
```
when this pipeline finishes, you can review the results in the BigQuery table: recordings_by_artists_dataflow
Then, run the nested pipeline using the script run.sh:
```
export DESTINATION_TABLE=recordings_by_artists_dataflow_nested
./run.sh nested
```
when this pipeline finishes, you can review the results in the BigQuery table: recordings_by_artists_dataflow_nested

Contact Us

We welcome all usage-related questions on Stack Overflow tagged with google-cloud-dataflow.

Please use issue tracker on GitHub to report any bugs, comments or questions regarding SDK development.

soomineom/bigquery-etl-dataflow-sample

Performing ETL into Big Query Tutorial Sample Code

Overview

Quickstart: Running the ETL Pipelines

Contact Us

More Information