To develop Apache Beam pipelines in Eclipse:
- Install the latest version of the Java SDK (not just the runtime) from http://www.java.com/
- Install Maven from http://maven.apache.org/
- Install Eclipse for Java Developers from http://www.eclipse.org/
- If you have not already done so, git clone the respository:
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
- Open Eclipse and do File | Import | From existing Maven files.
Browse to, and select
data-science-on-gcp/08_dataflow/chapter8/pom.xml
- You can now click on any Java file with a
main()
(e.g:CreateTrainingDataset1.java
) and select Run As | Java Application. Note that you might have to change paths (replace vlakshmanan by your user name, and cloud-training-demos-ml by your bucket).
To run the Dataflow pipelines from CloudShell:
- If you haven't already done so, git clone the repository.
- Launch a Dataflow pipeline (it will take 20-25 minutes):
cd data-science-on-gcp/08_dataflow ./create_datasets.sh bucket-name max-num-workers
- Visit https://console.cloud.google.com/dataflow to monitor the running pipeline.
- Once pipeline is finished, go to https://console.cloud.google.com/dataflow to see that you now have new data.
- Load the augmented data into BigQuery also:
bash ./to_bq.sh bucket-name