/bigquery-ingest-avro-dataflow-sample

Stream Avro SpecificRecord objects in BigQuery using Cloud Dataflow

Primary LanguageJavaApache License 2.0Apache-2.0

This tutorial describes storing Avro SpecificRecord objects in BigQuery using Cloud Dataflow by automatically generating the table schema and transforming the input elements. This tutorial also showcases the usage of Avro-generated classes to materialize or transmit intermediate data between workers in your Cloud Dataflow pipeline.

Please refer to the related article for all the steps to follow in this tutorial.

Contents of this repository:

  • BeamAvro: Java code for the Apache Beam pipeline deployed on Cloud Dataflow.
  • generator: Python code for the randomized event generator.

To run the example:

  1. Update configuration by updating env.sh

  2. Set environment variables

    source env.sh
  3. Generate java beans from the avro file and Run Dataflow pipeline:

    mvn clean compile package   

    Run the pipeline

    java -cp target/BeamAvro-bundled-1.0-SNAPSHOT.jar \
    com.google.cloud.solutions.beamavro.AvroToBigQuery \
    --project=$GOOGLE_CLOUD_PROJECT \
    --runner=DataflowRunner \
    --stagingLocation=gs://$MY_BUCKET/stage/ \
    --tempLocation=gs://$MY_BUCKET/temp/ \
    --inputPath=projects/$GOOGLE_CLOUD_PROJECT/topics/$MY_TOPIC \
    --workerMachineType=n1-standard-1 \
    --region=$REGION \
    --dataset=$BQ_DATASET \
    --bqTable=$BQ_TABLE \
    --outputPath=$AVRO_OUT \   
    --avroSchema="$(<../orderdetails.avsc)"
  4. Run event generation script:

    1. Create Python virtual environment
      python3 -m venv ~/generator-venv
      source ~/generator-venv/bin/activate
    2. Install python dependencies
      pip install -r generator/requirements.txt
    3. Run the Generator
      python generator/gen.py -p $GOOGLE_CLOUD_PROJECT -t $MY_TOPIC -n 100 -f avro