aws-solutions/network-orchestration-for-aws-transit-gateway

STNO Portal Shows only 1 CIDR

Opened this issue · 1 comments

Feature request?

We have been using STNO for some time now, its awesome, but only now we detected this behaviour.

STNO does all the glue from Spoke to Hub Accounts, most importantly:

  1. Create an Attachment between VPC-Spoke <> TGW-Hub
  2. Associate All Tagged SubNets to this Attachment
  3. HUB-Routing Table - add 1x Association
  4. HUB-Routing Table - add 1+ Propagations
  5. Add 1x CIDR Route on Spoke Account Routing Tables per SubNet, pointing to TGW-Hub

Behind the scenes CIDRs that are configured on the VPC will get propagated to the routing tables where the Attachment is set to be propagate into. *1

Within the STNO Portal we only see one CIDR to be approved (probably only the first VPC CIDR).

In our LandingZone environment we have been experiencing these symptoms:

  1. An Approver on the STNO Portal will not see all CIDRs that will get propagated into the Routing Tables as soon as he approves (it is important to validate the CIDRs in a LandingZone, because it needs to be unique within the LZ and we do have an IPAM to ensure this, as most companies do).
  2. If a Customer manually adds CIDRs to its VPC, a new CIDR will get pushed into the LZ without going thru STNO. *1
  3. With 1. and 2. it has become obvious that STNO db doesn't contain a single source of truth with all CIDRs.
  4. If by error, TAGs are deleted on SubNets / VPCs there is no Approval process controlling this, and all "glue" objects are deleted without Hub-Approval control or rollback.
  5. There is a downtime of 5 min + 5 mins to apply a simple Change TAGs procedure. We need to Delete TAGs first, wait5min for STNO to finish and then ADD TAGs again and wait5min again for STNO to finish. We found a way to manually configure the desired state on the Hub and then retag, this would reduce to 5sec + 5sec simply to reassociate the attachment to a new routing table.
  6. We detected issues when approving only a VPC on STNO Portal without the SubNets tagged. No Attachment is created if we first approve a VPC without tagging the SubNets. Then even the SubNets are tagged the attachment is not created, we need to restart the process.
  7. In a Spoke-Hub environment, usually Spoke and Hub are different accounts owned by different teams. Ou Hub Team is unable to see VPC configurations related to SGs/NACLs and Routing Tables, this limits and difficults when troubleshooting issues.

*1: This is inline with the public documentation here: https://docs.aws.amazon.com/vpc/latest/tgw/how-transit-gateways-work.html#tgw-route-propagation-overview , quoting from the same documentation: “For a VPC attachment, the CIDR blocks of the VPC are propagated to the transit gateway route table.” (Notice the “s” in CIDR blocks) .

Suggestions:

  1. To address items 1.2.3 probably it would be best to monitor VPC-CIDRs on Spoke accounts, to add/remove CIDRs into STNO db.
  2. To address item 4 we wouldn't go as far as limiting a Spoke to disable the delete of the attachment, but instead it would make sense to make the attachment be owned by HUB.
  3. To address item 5 a change tags procedure could exist, or even a roll-back.
  4. To address item 6 Group VPC and SubNets in STNO Portal including all VPC-CIDRs, allowing a one time first approval. Approving more SubNets to this group should be possible after first creation of the group.
  5. To address item 7 it would be a good idea to allow HUB to share ownership of the VPCs, SubNets, SecurityGrps and NACLs, and maybe other related objects like prefix-lists. Probably HUB Team would need to choose between read-only or read-write for each object, and Spoke Team would need to approve the chosen option.

We thank you for your thoughts, feed-back or anythings onto helping us is appreciated very much.

Thanks and keep up the good work guys...

@Cupidazul Thanks for diving deep and sharing your feedback.

For Items 1, 2, 3 - We have added this new use case to our backlog and will review in the next release.

For Item 4, the use case is to protect the route changes for TGW route tables that has approval required. We have added a backlog item to require approval if the tag was removed. However, not sure if we can add deletion protection. The user/roles in the spoke account or even SCPs should be implemented to deny VPC attachment deletion permissions.

For Item 5, as per design, you should be able to update the tags and it will trigger the workflow to update the association or propagation. Please advise if update is not working for you with the steps on how to duplicate the issue. It would be best to open a support case for this item if necessary.

For item 6, as per design, the attachment API requires at least one subnet ID. Tagging a VPC first can't create attachments. To start the attachment worflow, you must tag the VPC first then subnets. We can consider improving the UI experience by hiding the Approve/Reject button for VPC change item in the table to avoid confusion. To append a new subnet to the attachment they should be approved individually.

For item 7, this is outside the scope of the use cases and will not be supported as a feature for STNO.

Thanks again for reaching out to us.