christianramsey/dataflow_export_gcmle

Java

8. Time-windowed aggregate features

[optional] Setup Dataflow Development environment

To develop Apache Beam pipelines in Eclipse:

Install the latest version of the Java SDK (not just the runtime) from http://www.java.com/
Install Maven from http://maven.apache.org/
Install Eclipse for Java Developers from http://www.eclipse.org/

If you have not already done so, git clone the respository:

git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp

Open Eclipse and do File | Import | From existing Maven files. Browse to, and select data-science-on-gcp/08_dataflow/chapter8/pom.xml
You can now click on any Java file with a main() (e.g: CreateTrainingDataset1.java) and select Run As | Java Application. Note that you might have to change paths (replace vlakshmanan by your user name, and cloud-training-demos-ml by your bucket).

Run Dataflow from command-line to create augmented dataset for machine learning

To run the Dataflow pipelines from CloudShell:

If you haven't already done so, git clone the repository.

Launch a Dataflow pipeline (it will take 20-25 minutes):

cd data-science-on-gcp/08_dataflow
./create_datasets.sh bucket-name  max-num-workers

Visit https://console.cloud.google.com/dataflow to monitor the running pipeline.
Once pipeline is finished, go to https://console.cloud.google.com/dataflow to see that you now have new data.
Load the augmented data into BigQuery also:
```
bash ./to_bq.sh bucket-name
```