aws-ia/terraform-aws-eks-blueprints

Moving forward with v5 of EKS Blueprints

fcarta29 opened this issue Β· 33 comments

Community Note

  • Please vote on this issue by adding a πŸ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

This issue is being created to provide community notice and to help track the work and cutover to v5 of EKS Blueprints. More details of the changes can be found here:
Direction for v5 of EKS Blueprints
Motivation for v5 of EKS Blueprints

Do you have any ambitions when it comes to migrating from v4 to v5? It seems it will get complicated given the goals πŸ€”

That said, great initiative! We've been bothered by the same issues you describe and it seems like the correct way forward.

Edit: Did not get any direct answer but we can see an UPGRADE-5.0.md committed into the repository now so it would seem it's something they are thinking about quite a lot.

Any idea how long it will take for v5 to be production ready ? and how long you guys will maintain / patch / update v4 ?

FAQ for v5

When will v5 be GA and ready for production?
EKS Blueprints are community driven examples of how to build on AWS EKS. Only AWS services like AWS EKS have formal release dates, are officially supported, and are certified as production ready.

How long you guys will v4 be maintained and updated?
Once v5 reaches development/refactoring readiness and testing is complete, notice will be made to the community via this issue and on the top level README.md regarding when cross over from v4 to v5 will be merged. From this point v4 will be tagged/branched and left for those who desire to remain on v4 and for historical context. No further development/updates will occur for v4 and all future changes will only occur on the main v5 version.

What examples are being moved and where are they going?

@Hokwang for the moment we will keep using the existing IRSA module in the new addons repo.

This should’ve been communicated more clearly. I just have set up a new cluster using your control plane module.

This should’ve been communicated more clearly. I just have set up a new cluster using your control plane module.

@ManuelMueller1st Do you have suggestions on what that might have looked like - how we could have communicated the changes better?

@bryantbiggs do you have target date for v5? month or quarter?

This should’ve been communicated more clearly. I just have set up a new cluster using your control plane module.

@ManuelMueller1st Do you have suggestions on what that might have looked like - how we could have communicated the changes better?

A bullet point of major changes to the project on the first README page. Potential Breaking Changes could mean anything. I realized that the cluster module was discontinued after reading the v5 article.

@bryantbiggs do you have target date for v5? month or quarter?

@Hokwang - We are at the final stages regarding changes and are now working through the testing and migration guides. Once those are close to completion we will communicate a planned v5 release date.

I'm about to use EKS Blueprints with Karpenter, and I'm aware that version 5 will have breaking changes. How can I start implementing it now to minimize refactoring efforts when upgrading to version 5?

How does this affect the CDK version of EKS Blueprints? Seems like a split-brain decision to redefine Terraform EKS Blueprints to a best-practices repo, while keeping CDK/Cloudformation as a monolithic framework.

How does this affect the CDK version of EKS Blueprints? Seems like a split-brain decision to redefine Terraform EKS Blueprints to a best-practices repo, while keeping CDK/Cloudformation as a monolithic framework.

@automagic Both projects have different customers and requirements which may diverge.

CDK v2 is a single framework and it is what customers preferred. It also was the main reason for refactoring of CDK itself from v1 to v2. CDK EKS Blueprints follow the same pattern. The framework was designed to support internal dev platform experience covering all aspects.

Since CDK EKS Blueprints follow the npm model of deployment, extensions of the blueprints (such as additional add-ons, cluster providers, resource providers, etc.) can be published outside of the main framework as npm modules in a fully decoupled way (e.g. Datadog).

If there is anything that you find inconvenient in the CDK version of EKS Blueprints, please share (ideally here).

FYI - we are tentatively targeting next month (May) to make the transition over to v5. At that time we will provide guidance documentation for users to migrate from v4 to v5

@bryantbiggs Do you know if the recommended migration path will require cluster rebuilds? My company is currently using v4 to build out a new managed K8s platform. We aren't running production workloads yet so if there isn't a way to migrate to v5 without down time we would prefer to bite the bullet and do rebuilds now by calling the terraform-aws-eks modules directly.

@bryantbiggs Do you know if the recommended migration path will require cluster rebuilds? My company is currently using v4 to build out a new managed K8s platform. We aren't running production workloads yet so if there isn't a way to migrate to v5 without down time we would prefer to bite the bullet and do rebuilds now by calling the terraform-aws-eks modules directly.

Eks blueprints simply wrap terraform-aws-eks so you can always do a bunch of move statements to do clean migrations without downtime. We've done that at my company, although we do nothing as complex as a managed k8s platform.

Issues pops up with the addons though. Although they are mostly just thin wrappers around existing helms so shouldn't be too hard to maintain. I think the idea is that you should use argocd/gitops to manage much was is addons today instead of using terraform for them. Doing a migration from using terraform for handling them to argocd is also doable without downtime, and probably good practice to do, so you are comfortable with a bit more advanced transition work. Just make sure argocd do not prune πŸ™ˆ

But given your circumstance the argocd stuff might not be relevant :D

@bryantbiggs Do you know if the recommended migration path will require cluster rebuilds? My company is currently using v4 to build out a new managed K8s platform. We aren't running production workloads yet so if there isn't a way to migrate to v5 without down time we would prefer to bite the bullet and do rebuilds now by calling the terraform-aws-eks modules directly.

No - you should be able to modify the Terraform state to move from v4 to v5 and maintain the current control plane without a recreation

Link should be updated..

Direction for v5 of EKS Blueprints

dmonagha

Thanks!

Updated link to Motivation for v5 of EKS Blueprints

@bryantbiggs Do you know if the recommended migration path will require cluster rebuilds? My company is currently using v4 to build out a new managed K8s platform. We aren't running production workloads yet so if there isn't a way to migrate to v5 without down time we would prefer to bite the bullet and do rebuilds now by calling the terraform-aws-eks modules directly.

No - you should be able to modify the Terraform state to move from v4 to v5 and maintain the current control plane without a recreation

@bryantbiggs I think a markdown doc should be created walking thru migrating a basic v4 cluster with a couple addons updated to v5. This is a very popular tool for deploying eks, a lot of customers will need this information. Im thinking these terraform state modifications will be quite complex.

excellent! thanks!

The team is still working on the guidance for migrating addons and teams which is why those doc pages are currently empty

Starting assumption: most infrastructure people want to abstract the underlying cloud from resources running inside the cluster. Security people want to see the scope Kubernetes permissions being a subset of cloud permissions and want to reduce access to privileged accounts during run time. This means they want to orchestrate the creation of AWS resources outside of the cluster and not make use of something like ACK (AWS Controllers for Kubernetes). So, this generally leads to creating resources in a pipeline using something like Terraform that uses an elevated user at install time.

With these assumptions, I don't think that the current motivation clearly outlines the following:

  • Add-on's value is to create the prerequisite AWS resources required by a helm chart using Terraform and then to pass on any references to this AWS and other AWS specific configuration further down the chain
  • Add-on’s should probably aggregate most of their configuration and hopefully batch them into a 2 or 3 waves using an (app-of-apps) umbrella helm charts that create native Kubernetes CRD resources such as ArgoCD application. I think the mistake is that currently most people focus on them as defining which helm chart to use and the deployment of the helm chart. Ideally we would take this info out of the add-on concept. The only drawback I can think of is not knowing which helm chart configuration schema (values.yaml file) was used when forwarding the configuration downstream (app-of-apps -> wrapper helm chart -> helm chart). Individual interactions with Kubernetes via Terraform resource providers should be avoided as much as possible! Writing your own kube-wait-for-resource / kube-wait-for-resource-to-be-gone Terraform provider doesn’t work either. Even inside Kubernetes multiple waves might be required due to buggy/badly written resources that are depended on by downstream resources e.g. CSI drivers / order to delete CRDs in.
  • Blueprints should just be an example of how to organise a group of add-ons and not a generic list of all add-ons that is feature togglable. At the moment the motivation does this in too subtly a way for most users to get and we will see this repeated by other users spawning off github projects repeating this by offering this aggregated tf module

add-ons

https://aws-ia.github.io/terraform-aws-eks-blueprints/main/v4-to-v5/cluster/

@bryantbiggs Is this the migration guide?

I started migrating the addons from v4 repo to the new one. ( The eks cluster is already setup the new way, using the eks module ).
Is there a way how to migrate the addons one by one via state migration?
I have to keep the old blueprint module, because e.g. argo cd won't be supported in v5.

One add-on ( aws_load_balancer_controller ) I migrated already by deleting it first in the old module and adding it in a second TF apply in the new module:

//depriacted
module "eks_blueprints_kubernetes_addons" {
  source = "github.com/aws-ia/terraform-aws-eks-blueprints//modules/kubernetes-addons?ref=v4.32.1"
...

  enable_argocd = var.argocd_enabled
  //argo config
  ...
 
  // some addons I want to migrate to v5 repo , e.g.
  enable_amazon_eks_aws_ebs_csi_driver = true
  enable_aws_for_fluentbit                 = var.enable_aws_for_fluentbit
  aws_for_fluentbit_cw_log_group_retention = var.aws_for_fluentbit_cw_log_group_retention
  enable_external_dns            = true
  external_dns_route53_zone_arns = var.dns_extra_zones
  external_dns_helm_config = {
    values = [jsonencode(yamldecode(<<-EOT
      txtOwnerId: ${local.name}
      zoneIdFilters: ${local.zoneIdFilters}
      policy: 'sync'
      aws:
        zoneType: 'public'
        zonesCacheDuration: '1h'
      logLevel: 'debug'
      EOT
    ))]
  }
...
  
}

//new addons
module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.0"
...

  enable_aws_load_balancer_controller      = true
  aws_load_balancer_controller = {
    create_namespace = true
    namespace        = "lb-controller"
    values = [jsonencode(yamldecode(<<-EOT
      clusterName: ${local.name}
    ))]
  }
}
nmindz commented

This should’ve been communicated more clearly. I just have set up a new cluster using your control plane module.

@ManuelMueller1st Do you have suggestions on what that might have looked like - how we could have communicated the changes better?

At least a NOTICE at the top of the README with a link to the issue/etc regarding the new v5.0 should have been added. A nice "things to consider before you go ahead using things as they are".

Adding a notice to the release tag was good, but not enough as one may simply follow the README instructions and use the module without ever visiting the releases page.

I also happened to have just implemented a new EKS model repository that depended on this module.

@nmindz we had a notice on the main README for several months

It was only recently removed when we unveiled the changes for v5

nmindz commented

I also began works with that module back in Feb, but I didn't realize the README had changed since then and just today I was trying to figure out the options for the AWS Load Balancer plugin since that was not deployed by default in our internal template.

I've been having several mixed search results (such as the missing aws_load_balancer_controller_helm_config option in the ~> 1.0 version, which I assume is v5) and from guides/references that still point to that old option.

When I visited the v4 branch just now I realized the notice was there, but has recently been removed from the main branch, which was the one I checked before commenting, so first and foremost please excuse my lack of attention when replying today.

I know the addons/plugins guides are still WIP, but are there any pointers/common naming/config schemes already defined for v5 or are they still subject to change? (see snippets below)

I got some indications for the external_secrets deployment from another issue if I recall correctly, and made it so I could run it on Fargate:

  enable_external_secrets = true
  external_secrets = {
    namespace = "external-secrets",
    values = [yamlencode({
      "webhook" : { "port" = "9443" },
      "tolerations" : [{ "key" : "eks.amazonaws.com/compute-type", "operator" : "Equal", "value" : "fargate", "effect" : "NoSchedule" }]
    })]
  }

And re-reading @frank-bee's comment I also realized he had a similar config for AWS Load Balancer:

  enable_aws_load_balancer_controller = true
  aws_load_balancer_controller = {
    create_namespace = true
    namespace        = "lb-controller"
    values = [jsonencode(yamldecode(<<-EOT
      clusterName: ${local.name}
    ))]
  }

Also, not sure if I clearly understood but according to @fcarta29 FAQ comment, only examples are being moved away from this repository, is that correct?

in v5, the Terraform modules are removed from this repository and only examples/blueprints will remain. You can see the new home of addons here https://github.com/aws-ia/terraform-aws-eks-blueprints-addons which has its own documentation https://aws-ia.github.io/terraform-aws-eks-blueprints-addons/main/

The new addons are 1.x and stable for use today

It would be good to get some examples of migration scripts. While the full path of the resources and data in the terraform state of an individual deployment will obviously differ a lot between users individual terraform deployments, overall there will be significant similarity in the paths inside the modules that we need to migrate from v4 to v5 with new external addons/modules.

So ideally we should be able to get documentation that outlines what we need to run to migrate the terraform state, otherwise everyone will have to individually work out how they need to perform the dozens of commands like the ones I'll put below and a lot of duplicate work could be saved if this was documented as part of the migration to v5.

terraform state mv '"${v4_module_name}"/${example_eks_blueprints_v4_component}/example_resource_one' '"${v5_module_name}"/${example_eks_blueprints_v5_separated_component}/example_resource_one'
terraform state mv '"${v4_module_name}"/${example_eks_blueprints_v4_component}/example_resource_two' '"${v5_module_name}"/${example_eks_blueprints_v5_separated_component}/example_resource_two'
terraform state mv '"${v4_module_name}"/${example_eks_blueprints_v4_component}/example_resource_three' '"${v5_module_name}"/${example_eks_blueprints_v5_separated_component}/example_resource_three'
terraform state mv '"${v4_module_name}"/${example_eks_blueprints_v4_component}/example_submodule/example_subresource_one' '"${v5_module_name}"/${example_eks_blueprints_v5_separated_component}/example_submodule_with_a_new_name/example_subresource_one'

The v4 and v5 structures are known and it should be possible to provide some level of assistance to the many users who are currently stuck trying to build a huge list of terraform state mv commands and implement migration plans for the many subcomponents.

Edit: (In case someone just says to replace the module and use the documentation to our own settings in the new way to each addon...)
As nice as it is having Terraform managed idempotent infrastructure we can easily teardown and replace, it shouldn't be considered acceptable to tear down a whole production Kubernetes cluster just to clean up this v4 to v5 process where the individual components we used v4 to install may have only changed by a patch version or not at all. Tearing down a cluster can involve a lot of annoying downtime, that has to be scheduled, and planned for, and is generally speaking a lot of non productive work that can be avoided if we have documentation like the kind I mentioned.

the changes for v5 are now complete - please see our docs section on v4 to v5 for details on the motivation, context, and migration paths https://aws-ia.github.io/terraform-aws-eks-blueprints/main/v4-to-v5/motivation/

@bryantbiggs Does this mean v5 work is 100% complete and ready to use?

yes, the v5 approach has been available for some time now - we left this open for questions/concerns while updating docs/messaging/etc.