This is the sample code for the Performing ETL from a Relational Database into BigQuery using Dataflow. The tutorial explains how to ingest highly normalized (OLTP database style) data into Big Query using DataFlow. To understand this sample code it is recommended that you review the Apache Beam programming model.
This tutorial relies on the musicbrainz dataset.
Note that this tutorial assumes that either:
- you are using Cloud Shell
- you have Java 8, maven and Google Cloud SDK installed.
It also requires that you have a project, dataflow staging bucket and big query dataset already created.
The repository consists of the following Java classes:
- com.google.cloud.bqetl.BQETLSimple - Simple pipeline for ingesting musicbrainz artist recording data as a flat table of artist's recordings.
- com.google.cloud.bqetl.BQETLNested - Revision of the simple pipeline that nests the artist's recordings as a repeated record inside each Big Query table row which pertains to an artist.
- com.google.cloud.bqetl.mbdata.MusicBrainzDataObject - a general purpose object to represent a row of MusicBrainz data
- com.google.cloud.bqetl.mbdata.MusicBrainzTransforms - the library of functions that implements the transforms used in the above pipelines and allows them to be reused
- com.google.cloud.bqetl.json.JSONReader - class for converting a JSON representation of a MusicBrainz tow into a MusicBrainzDataObject
- com.google.cloud.bqetl.options.BQETLOptions - the options for the BQETL pipelines
- com.google.cloud.bqetl.JSONReaderTest - test for the JSONReader
- com.google.cloud.bqetl.mbdata.MusicBrainzTransformsTest - tests the transforms library
The repository consists of the following scripts and resources:
- src/test/resources data files for the test classes
- dataflow-staging-policy.json - a policy for expiring objects in the bucket used for staging the dataflow jobs
- run.sh - example script for running the pipelines using maven
- pom.xml - maven build script
NOTE: For more detail instructions, and pipeline details, see the full tutorial on Google Cloud Solutions.
-
Follow the instructions to set up your project and service account:
-
Set your environment variables.
For example:
export PROJECT_ID=[YOUR_PROJECT_ID] export ZONE=[CHOOSE_AN_APPROPRIATE_ZONE] # e.g. us-east1-c export STAGING_BUCKET=${PROJECT_ID}-etl-staging-bucket export DATASET=musicbrainz
-
First, run the
simple
pipeline using the scriptrun.sh
:export DESTINATION_TABLE=recordings_by_artists_dataflow ./run.sh simple
when this pipeline finishes, you can review the results in the BigQuery table:
recordings_by_artists_dataflow
-
Then, run the
nested
pipeline using the scriptrun.sh
:export DESTINATION_TABLE=recordings_by_artists_dataflow_nested ./run.sh nested
when this pipeline finishes, you can review the results in the BigQuery table:
recordings_by_artists_dataflow_nested
We welcome all usage-related questions on Stack Overflow
tagged with google-cloud-dataflow
.
Please use issue tracker on GitHub to report any bugs, comments or questions regarding SDK development.