This tutorial describes storing Avro SpecificRecord objects in BigQuery using Cloud Dataflow by automatically generating the table schema and transforming the input elements. This tutorial also showcases the usage of Avro-generated classes to materialize or transmit intermediate data between workers in your Cloud Dataflow pipeline.
Please refer to the related article for all the steps to follow in this tutorial.
Contents of this repository:
BeamAvro
: Java code for the Apache Beam pipeline deployed on Cloud Dataflow.generator
: Python code for the randomized event generator.
To run the example:
-
Update configuration by updating env.sh
-
Set environment variables
source env.sh
-
Generate java beans from the avro file and Run Dataflow pipeline:
mvn clean compile package
Run the pipeline
java -cp target/BeamAvro-bundled-1.0-SNAPSHOT.jar \ com.google.cloud.solutions.beamavro.AvroToBigQuery \ --project=$GOOGLE_CLOUD_PROJECT \ --runner=DataflowRunner \ --stagingLocation=gs://$MY_BUCKET/stage/ \ --tempLocation=gs://$MY_BUCKET/temp/ \ --inputPath=projects/$GOOGLE_CLOUD_PROJECT/topics/$MY_TOPIC \ --workerMachineType=n1-standard-1 \ --region=$REGION \ --dataset=$BQ_DATASET \ --bqTable=$BQ_TABLE \ --outputPath=$AVRO_OUT \ --avroSchema="$(<../orderdetails.avsc)"
-
Run event generation script:
- Create Python virtual environment
python3 -m venv ~/generator-venv source ~/generator-venv/bin/activate
- Install python dependencies
pip install -r generator/requirements.txt
- Run the Generator
python generator/gen.py -p $GOOGLE_CLOUD_PROJECT -t $MY_TOPIC -n 100 -f avro
- Create Python virtual environment