This project deals Automating the entire flow from moving the necessary data to cloud storage using GCP clinet libraries which is used in java program to upload the input files, Jars, Templates etc to cloud storage to running a series of Spark Jobs on Cloud Dataproc with the help of Workflow template.
Stackdriver is used for email notofcation, monitoring and logs
The project is split into mutiple modules:-
- Java program - to create bucket, upload files
- Conversion - for transforming different file formats like csv, txt to parquet for faster big data processing
- Pre Processing - to deal with making all the dataset to a specific type (like person specific or area specific)
- Merging - After preprocessing the datasets need to linked together based on some attribute
- Model - to generate the points for each person based on some condition
-
Go to APIs and Services page of GCP Console and click on credentials
-
create credentials with servie account keys
-
Select json and new service account
-
provide role of project owner
-
Build Conversion, linking, pre processing and model maven projects in local
-
Copy input files, jars to specific folders in the main project src/main/resources/filesToUpload
-
Main project has all the folders necessary folders created
-
Run the main project with args to be given (input, jars, template location in src/main/resources etc)
-
verify bucket is created with necessary files uploaded to bucket
-
Open shell in GCP
-
create a yaml file and copy the template content to it
-
Run gcloud dataproc workflow-templates create cloud-migration-demo-template
gcloud dataproc workflow-templates import cloud-migration-demo-template --source cloud-migration-demo.yaml
gcloud dataproc workflow-templates instantiate-from-file --file cloud-migration-demo.yaml
gcloud dataproc workflow-templates delete cloud-migration-demo-template