/gcloud-data-notes

Notes for Data Engineer – Professional Certification Preparation for Google

Apache License 2.0Apache-2.0

gcloud-data-notes

Study notes for Data Engineer – Professional Certification Preparation for Google

Install and Configure Google Cloud SDK

  1. Download the gcloud SDK
  2. Activate your project gcloud init or gcloud config configurations activate [profile name]
  3. See Managing SDK Configurations and All How-to Guides
  4. The gcloud config files are located under ~/.config/ by default.

Core Commands

gcloud auth list
gcloud config list
gcloud info
gcloud help
gcloud help compute instances create

Google Big Data Products

Google Training


Big Query


Data Flow / Apache Beam

This file contains text you can copy and paste for the examples in Cloud Academy's Introduction to Google Cloud Dataflow course.

See Apache Beam Java SDK Quickstart for additional tutorials.

Before you Begin

Note: Creating a gcloud service account is required to run locally for gs buckets.

See Before you Begin for steps to follow to create a service account, assign it a role, and download the key.

Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.

gcloud auth application-default login

export GOOGLE_APPLICATION_CREDENTIALS="~/.config/gcloud/application_default_credentials.json"

See Setting up a Java Development Environment for Apache Beam on Google Cloud Platform.

What is a pipeline?

Taken from slides for Learn Stream processing with Apache Beam

  • A pipeline is a ...

    • A Directed Acyclic Graph (DAG) of data transformations
    • Possibly unbounded collections of data flow on the edges
    • May include multiple sources and multiple sinks
    • Optimized and executed as a unit
  • A pipeline describes ...

    • What are you computing?
    • Where in event time?
    • When in processing time?
    • How do refinements relate?
  • A pipeline describes ...

    • What = Transformations
    • Where = Windowing
    • When = Watermarks + Triggers
    • How = Accumulation

Building and Running a Pipeline

Installing on your own computer: https://cloud.google.com/dataflow/docs/quickstarts
Transforms: https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/package-summary.html

git clone https://github.com/cloudacademy/beam.git
cd beam/examples/java8
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.MinimalLineCount
gsutil cat gs://dataflow-samples/shakespeare/kinglear.txt | wc

Deploying a Pipeline on Cloud Dataflow

Setup Project and Bucket environment variables...

export PROJECT=$(gcloud config get-value core/project)
export BUCKET=gs://dataflow-$PROJECT

# or ...

nano ~/.profile
    PROJECT=[Your Project ID]
    BUCKET=gs://dataflow-$PROJECT

Create a Google Storage bucket...

gsutil mb $BUCKET
cd ~/beam/examples/java8

Run the MinimalLineCount example:

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.MinimalLineCountArgs \
  -Dexec.args="--runner=DataflowRunner \
  --project=$PROJECT \
  --tempLocation=$BUCKET/temp"

Run the line LineCount example:

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.LineCount \
  -Dexec.args="--runner=DataflowRunner \
  --project=$PROJECT \
  --tempLocation=$BUCKET/temp \
  --output=$BUCKET/linecount"

Custom Transforms

cd ~/beam/examples/java8
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.MinimalWordCount \
  -Dexec.args="--runner=DataflowRunner \
  --project=$PROJECT \
  --tempLocation=$BUCKET/temp \
  --output=$BUCKET/wordcounts"

Composite Transforms

cd ~/beam/examples/java8
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.complete.game.UserScore \
-Dexec.args="--runner=DataflowRunner \
  --project=$PROJECT \
  --tempLocation=$BUCKET/temp/ \
  --output=$BUCKET/scores"

Windowing

cd ~/beam/examples/java8
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.complete.game.HourlyTeamScore \
-Dexec.args="--runner=DataflowRunner \
  --project=$PROJECT \
  --tempLocation=$BUCKET/temp/ \
  --output=$BUCKET/scores \
  --startMin=2015-11-16-16-00 \
  --stopMin=2015-11-17-16-00"

Running LeaderBoard

bq mk game

API Console Credentials: https://console.developers.google.com/projectselector/apis/credentials

export GOOGLE_APPLICATION_CREDENTIALS="[Path]/[Credentials file]"
cd ~/beam/examples/java8
mvn compile exec:java  -Dexec.mainClass=org.apache.beam.examples.complete.game.injector.Injector \
-Dexec.args="$PROJECT game none"
cd ~/beam/examples/java8
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.complete.game.LeaderBoard \
-Dexec.args="--runner=DataflowRunner \
  --project=$PROJECT \
  --tempLocation=$BUCKET/temp/ \
  --output=$BUCKET/leaderboard \
  --dataset=game \
  --topic=projects/$PROJECT/topics/game"

Create a new java project with the mvn archetype:generate command

mvn archetype:generate \
      -DarchetypeGroupId=org.apache.beam \
      -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
      -DarchetypeVersion=2.9.0 \
      -DgroupId=com.russomi \
      -DartifactId=word-count-beam \
      -Dversion="0.1" \
      -Dpackage=com.russomi.beam.examples \
      -DinteractiveMode=false

Execute the Wordcount sample locally

mvn compile exec:java -Dexec.mainClass=com.russomi.beam.examples.WordCount \
     -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner

See the Java Quickstart for more.


Dataflow Templates

Cloud Dataflow templates allow you to stage your pipelines on Cloud Storage and execute them from a variety of environments. You can use one of the Google-provided templates or create your own.

Templates provide you with additional benefits compared to traditional Cloud Dataflow deployment, such as:

* Pipeline execution does not require you to recompile your code every time.
* You can execute your pipelines without the development environment and associated dependencies that are common with traditional deployment. This is useful for scheduling recurring batch jobs.
* Runtime parameters allow you to customize the execution of the pipeline.
* Non-technical users can execute templates with the Google Cloud Platform Console, gcloud command-line tool, or the REST API.

Cloud Dataflow templates use runtime parameters to accept values that are only available during pipeline execution. To customize the execution of a templated pipeline, you can pass these parameters to functions that run within the pipeline (such as a DoFn).

Creating and staging templates
   mvn compile exec:java \
     -Dexec.mainClass=com.example.myclass \
     -Dexec.args="--runner=DataflowRunner \
                  --project=[YOUR_PROJECT_ID] \
                  --stagingLocation=gs://[YOUR_BUCKET_NAME]/staging \
                  --templateLocation=gs://[YOUR_BUCKET_NAME]/templates/MyTemplate"

This example creates a batch job from a template that reads a text file and writes an output text file.

    gcloud dataflow jobs run [JOB_NAME] \
        --gcs-location gs://[YOUR_BUCKET_NAME]/templates/MyTemplate \
        --parameters inputFile=gs://[YOUR_BUCKET_NAME]/input/my_input.txt,outputFile=gs://[YOUR_BUCKET_NAME]/output/my_output

Google provides a set of open-source Cloud Dataflow templates.

Dataflow / Apache Beam Resources


Composer / Airflow

Google Composer (Apache Airflow) Notes

Resources

Blogs

Testing DAGs

Google Composer is based on airflow. All operators, etc, they re-contribute back to airflow.

The only issue is there is no Repo or tag or anything that allows you to pull down Composer’s version of airflow. They have some from airflow 1.9, some from 1.10 some from the master branch. Composer is currently in 1.9 but has released 1.10 for beta. We are using the 1.9 right now.

Because of this, use their ‘base’ Docker image for testing DAGs. We build off of that and run tests.

What to Verify

  • Validity and Integrity of the dag
  • Non Cyclic
  • Workflow is going where we expect (task to task, etc.)
  • We can query properties for the Dag and for each Operator (task) to test/verify them

Resources


These instructions will show you how to get a development environment up and running to start developing Java Dataflow jobs. By the end you’ll be able to run a Dataflow job locally in debug mode, execute code in a REPL to speed your development cycles, and submit your job to Google Cloud Dataflow.

Google Cloud Setup

export PROJECT=$(gcloud config get-value core/project)

Create a bucket

gsutil mb -c regional -l us-central1 gs://$PROJECT export BUCKET=$PROJECT

Create a BigQuery dataset

bq mk DataflowJavaSetup bq mk java_quickstart

Cloning the Dataflow Templates Repo

git clone https://github.com/GoogleCloudPlatform/DataflowTemplates

Run the Apache Beam Pipeline Locally

mvn clean && mvn compile

Create a Run/Debug configuration for the class that defines this Apache Beam pipeline

Add Application working directory drop down select %MAVEN_REPOSITORY% Choose class Add arguments

--project=sandbox
--stagingLocation=gs://sandbox/staging
--tempLocation=gs://sandbox/temp
--templateLocation=gs://sandbox/templates/GcsToBigQueryTemplate.json
--runner=TestDataflowRunner
--javascriptTextTransformGcsPath=gs://sandbox/resources/UDF/sample_UDF.js
--JSONPath=gs://sandbox/resources/schemas/sample_schema.json
--javascriptTextTransformFunctionName=transform
--inputFilePattern=gs://sandbox/data/data.txt
--outputTable=sandbox:java_quickstart.colorful_coffee_people
--bigQueryLoadingTemporaryDirectory=gs://sandbox/bq_load_temp/

Set a break point and run the debugger

line 93 of TextIOToBigQuery.java

Right Click > “Debug TextIOToBigQuery”

Right Click and choose “Evaluate expression”

Run the Apache Beam Pipeline on the Google Cloud Dataflow Runner

sample_schema.json:

{
 “BigQuery Schema”: [
 {
 “name”: “location”,
 “type”: “STRING”
 },
 {
 “name”: “name”,
 “type”: “STRING”
 },
 {
 “name”: “age”,
 “type”: “STRING”
 },
 {
 “name”: “color”,
 “type”: “STRING”
 },
 {
 “name”: “coffee”,
 “type”: “STRING”
 }
 ]
}

sample_UDF.js:

function transform(line) {
 var values = line.split(,);
 var obj = new Object();
 obj.location = values[0];
 obj.name = values[1];
 obj.age = values[2];
 obj.color = values[3];
 obj.coffee = values[4];
 var jsonString = JSON.stringify(obj);
 
return jsonString;
}

data.txt:

US,joe,18,green,late
CAN,jan,33,red,cortado
MEX,jonah,56,yellow,cappuccino

Stage these files in Google Cloud Storage:

gsutil cp ./sample_UDF.js gs://$BUCKET/resources/UDF/
gsutil cp ./sample_schema.json gs://$BUCKET/resources/schemas/
gsutil cp ./data.txt gs://$BUCKET/data/

mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.TextIOToBigQuery -Dexec.cleanupDaemonThreads=false -Dexec.args=” \
--project=$PROJECT \
--stagingLocation=gs://$BUCKET/staging \
--tempLocation=gs://$BUCKET/temp \
--templateLocation=gs://$BUCKET/templates/GcsToBigQueryTemplate.json \
--runner=DataflowRunner”

Submit the job

gcloud dataflow jobs run colorful-coffee-people-gcs-test-to-big-query \
--gcs-location=gs://$BUCKET/templates/GcsToBigQueryTemplate.json \
--zone=us-central1-f \
--parameters=javascriptTextTransformGcsPath=gs://$BUCKET/resources/UDF/sample_UDF.js,JSONPath=gs://$BUCKET/resources/schemas/sample_schema.json,javascriptTextTransformFunctionName=transform,inputFilePattern=gs://$BUCKET/data/data.txt,outputTable=$PROJECT:java_quickstart.colorful_coffee_people,bigQueryLoadingTemporaryDirectory=gs://$BUCKET/bq_load_temp/