bigquery-ingest-avro-dataflow-sample: A Java repository from d-issy

This tutorial describes storing Avro SpecificRecord objects in BigQuery using Cloud Dataflow by automatically generating the table schema and transforming the input elements. This tutorial also showcases the usage of Avro-generated classes to materialize or transmit intermediate data between workers in your Cloud Dataflow pipeline.

Please refer to the related article for all the steps to follow in this tutorial.

Contents of this repository:

BeamAvro: Java code for the Apache Beam pipeline deployed on Cloud Dataflow.
generator: Python code for the randomized event generator.

To run the example:

Update configuration by updating env.sh
Set environment variables
```
source env.sh
```

Generate java beans from the avro file and Run Dataflow pipeline:

mvn clean compile package

Run the pipeline

java -cp target/BeamAvro-bundled-1.0-SNAPSHOT.jar \
com.google.cloud.solutions.beamavro.AvroToBigQuery \
--project=$GOOGLE_CLOUD_PROJECT \
--runner=DataflowRunner \
--stagingLocation=gs://$MY_BUCKET/stage/ \
--tempLocation=gs://$MY_BUCKET/temp/ \
--inputPath=projects/$GOOGLE_CLOUD_PROJECT/topics/$MY_TOPIC \
--workerMachineType=n1-standard-1 \
--region=$REGION \
--dataset=$BQ_DATASET \
--bqTable=$BQ_TABLE \
--outputPath=$AVRO_OUT \   
--avroSchema="$(<../orderdetails.avsc)"

Run event generation script:

Create Python virtual environment

python3 -m venv ~/generator-venv
source ~/generator-venv/bin/activate

Install python dependencies

pip install -r generator/requirements.txt

Run the Generator

python generator/gen.py -p $GOOGLE_CLOUD_PROJECT -t $MY_TOPIC -n 100 -f avro

d-issy/bigquery-ingest-avro-dataflow-sample