/beamexample

An example Apache Beam project.

Primary LanguageJava

Apache Beam Example Code

An example Apache Beam project.

Description

This example can be used with conference talks and self-study. The base of the examples are taken from Beam's example directory. They are modified to use Beam as a dependency in the pom.xml instead of being compiled together. The example code is changed to output to local directories.

How to clone and run

  1. Open a terminal window.
  2. Run git clone git@github.com:eljefe6a/beamexample.git
  3. Run cd beamexample/BeamTutorial
  4. Run mvn compile
  5. Create local output directory: mkdir output
  6. Run mvn compile exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Pdirect-runner
  7. Run cat output/user_score to verify the program ran correctly and the output file was created.

Using a Java IDE

  1. Follow the IDE Setup instructions on the Apache Beam Contribution Guide.

Other Runners

Apache Flink

  1. Follow the first steps from Flink's Quickstart to download Flink.
  2. Create the output directory.
  3. To run on a JVM-local cluster: mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=[local]' -Pflink-runner
  4. To run on an out-of-process local cluster (note that the steps below should also work on a real cluster if you have one running):
    1. Start a local Flink cluster.
    2. Navigate to the WebUI (typically http://localhost:8081), click JobManager, and note the value of jobmanager.rpc.port. The default is probably 6123.
    3. Run mvn package -Pflink-runner to generate a JAR file. Note the location of the generated JAR (probably ./target/BeamTutorial-bundled-flink.jar)
    4. Run mvn -X -e compile exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=localhost:6123 --filesToStage=./target/BeamTutorial-bundled-flink.jar' -Pflink-runner, replacing the defaults for port and JAR file if they differ.
    5. Check in the WebUI to see the job listed.
  5. Run cat output/user_score to verify the pipeline ran correctly and the output file was created.

Apache Spark

  1. Create the output directory.
  2. Allow all users (Spark may run as a different user) to write to the output directory. chmod 1777 output.
  3. Change the output file to a fully-qualified path. For example, this("output/user_score"); to this("/home/vmuser/output/user_score");
  4. Run mvn package -Pspark-runner
  5. Run spark-submit --jars ./target/BeamTutorial-bundled-spark.jar --class org.apache.beam.examples.tutorial.game.solution.Exercise2 --master yarn-client ./target/BeamTutorial-bundled-spark.jar --runner=SparkRunner

Google Cloud Dataflow

  1. Follow the steps in either of the Java quickstarts for Cloud Dataflow to initialize your Google Cloud setup.
  2. Create a bucket on Google Cloud Storage for staging and output.
  3. Run mvn -X compile exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Dexec.args='--runner=DataflowRunner --project=<YOUR-GOOGLE-CLOUD-PROJECT> --gcpTempLocation=gs://<YOUR-BUCKET-NAME> --outputPrefix=gs://<YOUR-BUCKET-NAME>/output/' -Pdataflow-runner, after replacing <YOUR-GCP-PROJECT> and <YOUR-BUCKET-NAME> with the appropriate values.
  4. Check the Cloud Dataflow Console to see the job running.
  5. Check the output bucket to see the generated output: https://console.cloud.google.com/storage/browser/<YOUR-BUCKET-NAME>/

Further Reading