Ability for HB to create Dataflow on existing VPC (not default VPC)
gcpjaru opened this issue · 5 comments
Expected Behavior
HB to create Dataflow on customer's VPC
Actual Behavior
HB creates Dataflow on default VPC only, if default VPC is not existed, then Dataflow will fail
The customer doesn't use default VPC due the security best practices.
Steps to Reproduce the Problem
- Create HB job
- Go go GCP --> Dataflow
- Look for Dataflow worker subnet
Specifications
- Version:
- Platform:
Thanks for reporting this. There could be a few scenarios here
- Single default subnet
- Multiple VPC's exist, including default
- Multiple VPC's, excluding default
I think the ideal behaviour here would be to select the VPC and Subnet for a list of available options.
Did the POC to launch a Dataflow job inside a VPC.
Next: Working on adding it as a feature.
Tested on the feature by running HB locally (not gcloud), here is the result:
- Using custom VPC/Subnets in the project, it can successfully run HB
- Using custom VPC/Subnets from Shared VPC, it cannot run. This is not permission issue, instead it cannot find the networks/subnet.
Error when launching dataflow on Shared VPC subnets:
Failed to start the VM, launcher-2023030801105316106672755625837005, used for launching because of status code: INVALID_ARGUMENT, reason: Invalid Error: Message: Invalid value for field 'resource.networkInterfaces[0].subnetwork': 'regions/asia-southeast1/subnetworks/sg-subnet1'. The referenced subnetwork resource cannot be found. HTTP Code: 400.
Normally using SharedVPC it requires to provide a complete URL path:
https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
https://cloud.google.com/dataflow/docs/guides/specifying-networks
For running dataflow in a shared VPC, we will need to:
- Accept the host project ID as an input.
- Use the full URL instead of the relative path to specify
subnetwork
.
Will look into making these changes in Harbourbridge.
Tested and confirmed working properly.
Note: make sure you give a role "Compute Network User" to default dataflow service account in the host project.
E.g. service-<service_project_id>@dataflow-service-producer-prod.iam.gserviceaccount.com