/aks-baseline-regulated

This is the Azure Kubernetes Service (AKS) Baseline Cluster for Regulated Workloads reference implementation as produced by the Microsoft Azure Architecture Center.

MIT LicenseMIT

Azure Kubernetes Service (AKS) Baseline Cluster for Regulated Workloads

This reference implementation demonstrates the recommended starting (baseline) infrastructure architecture for an AKS cluster that is under regulatory compliance requirements (such as PCI). This implementation builds directly upon the AKS Baseline Cluster reference implementation and adds to it additional implementation points that are more commonly seen in regulated environments vs typical "public cloud" consumption patterns.

🎓 Foundational Understanding
If you haven't familiarized yourself with the general-purpose AKS baseline cluster architecture, you should start there before continuing here. This architecture is constructed from the AKS baseline, which is the foundation for this body of work. This reference implementation avoids rearticulating points that are already addressed in the AKS baseline cluster.

Compliance

⚠️ These artifacts have not been certified in any official capacity; regulatory compliance is a shared responsibility between you and your hosting provider. This implementation is designed to aide you on your journey to achieving your compliance, but by itself does not ensure any level of compliance. To understand Azure compliance and shared responsibility models, visit the Microsoft Trust Center.

Azure and AKS are well positioned to give you the tools and allow you to build processes necessary to help you achieve a compliant hosting infrastructure. The implementation details can be complex, as is the overall process of compliance. We walk through the deployment here in a rather verbose method to help you understand each component of this architecture, teaching you about each layer and providing you with the knowledge necessary to apply it to your unique compliance scoped workload.

Even if you are not in a regulated environment, this infrastructure demonstrates an AKS cluster with a more heightened security posture over the general-purpose cluster presented in the AKS baseline. You might find it useful to take select concepts from here and apply it to your non-regulated workloads (at the tradeoff of added complexity and hosting costs).

Azure Architecture Center guidance

This project has a companion set of articles that describe challenges, design patterns, and best practices for a AKS cluster designed to host workloads that fall in PCI-DSS 3.2.1 scope. You can find this article on the Azure Architecture Center at Azure Kubernetes Service (AKS) regulated cluster for PCI-DSS 3.2.1. If you haven't reviewed it, we suggest you read it; as it will give added context to the considerations applied in this implementation. This repo primarly focuses on deployment concerns, while compliance concerns are mostly addressed in the linked article series above.

Architecture

This reference implementation is infrastructure focused, more so than workload. It concentrates on dealing with the AKS cluster itself. This implementation will touch on workload concerns, but does not contain end-to-end guidance on in-scope workload architecture, container security, or isolation. There are some good practices demonstrated and others talked about, but it is not exhaustive.

The implementation presented here is the minimum starting point for most AKS clusters falling into a compliance scope. This implementation integrates with Azure services that will deliver observability, provide a network topology that will support public traffic isolation, and keep the in-cluster traffic secure as well. This architecture should be considered your architectural starting point for pre-production and production stages of clusters hosting regulated workloads.

The material here is relatively dense. We strongly encourage you to start by reading the Azure Architecture Center guidance linked above and then dedicate at least four hours to walk through these instructions, with a mind to learning. You will not find any "one click" deployment here. However, once you've understood the components involved and identified the shared responsibilities between your team and your greater IT organization, it is encouraged that you build auditable deployment processes around your final infrastructure.

Finally, this implementation uses a small, custom application as an example workload. This workload is minimally interesting, as it is here exclusively to help you experience the infrastructure and illustrate network and security controls in place. The workload, and its deployment, does not represent any sort of "best practices" for regulated workloads.

Core architecture components

Azure platform

  • AKS v1.21
  • Azure Virtual Networks (hub-spoke)
    • Azure Firewall managed egress
    • Hub-proxied DNS
    • BYO Private DNS Zone for AKS with no public DNS representation
  • Azure Application Gateway (WAF - OWASP 3.2)
  • AKS-managed Internal Load Balancers
  • Azure Bastion for maintenance access
  • Private Link enabled Key Vault and Azure Container Registry
  • Private Azure Container Registry Task Runners

In-cluster Open-Source Software components

Network topology

Network diagram depicting a hub-spoke network with two peered VNets. The cluster spoke contains subnets for a jump box, the cluster, and other related services.

Workload HTTPS request flow

Network flow showing Internet traffic passing through Azure Application Gateway then into the Ingress Controller and then through the workload pods. All connections are TLS.

Deploy the reference implementation

A deployment of AKS-hosted workloads typically experiences a separation of duties and lifecycle management in the area of identity & security group management, the host network, the cluster infrastructure, and finally the workload itself. This reference implementation will have you be working across these various roles. Regulated environments require strong, documented separation of concerns; but ultimately you'll decide where each boundary should be.

Also, please remember the primary purpose of this body of work is to illustrate the topology and decisions made in this cluster. A guided, "step-by-step" flow will help you learn the pieces of the solution and give you insight into the relationship between them. A bedrock understanding of your infrastructure, its supply chain, and its "Day-2" workflows are critical for compliance concerns. If you cannot explain each decision point and rationalization, audit conversations can quickly turn uncomfortable.

Ultimately, lifecycle/SDLC management of your cluster, its dependencies, and your workloads will depend on your specific situation. You'll need to account for team roles, centralized & decentralized IT roles, organizational standards, industry expectations, and specific mandates by your compliance auditor.

Please start this learning journey in the Prepare the subscription section. If you follow this through the end, you'll have our recommended baseline cluster for regulated industries installed, with a sample workload running for you to reference in your own Azure subscription.

1. 🚀 Prepare the subscription

There are considerations that must be addressed before you start deploying your cluster. Do I have enough permissions in my subscription and AD tenant to do a deployment of this size? How much of this will be handled by my team directly vs having another team be responsible?

2. Build regional networking hub

This reference implementation is built on a traditional hub-spoke model, typical found in your organization's Connectivity subscription. The hub will contain Azure Firewall, DNS forwarder, and Azure Bastion services.

3. Plan Kubernetes API server access

Because the AKS server is a "private cluster" the control plane is not exposed to the internet. Management now can only be performed with network line of sight to the private endpoint exposed by AKS. In this case, you'll build a Azure Bastion-fronted jump box.

4. Deploy the cluster

Deploy the Azure resources that make up the primary runtime components of this architecture; the AKS cluster itself, jump box, Azure Container Registry, Azure Application Gateway, and Azure Key Vault.

and then bootstrap it

Bootstrapping your cluster should be seen as a direct immediate follow of deploying any cluster. This takes the raw AKS cluster and enrolls it in GitOps which will adds workload-agnostic baseline functionality (such as security agents).

5. Deploy your workload

A simple workload made up of four interconnected services is manually deployed across two namespaces to illustrate concepts such as nodepool placement, zero-trust network policies, and external infrastructure protections offered by the applied NSGs and Azure Firewall rules.

6. 🏁 Validation

Now that the cluster and the sample workload is deployed; now it's time to look at how the cluster is functioning.

7. 🧹 Clean up resources

Most of the Azure resources deployed in the prior steps will have ongoing billing impact unless removed.

Separation of duties

All workloads that find themselves in compliance scope usually require a documented separation of duties/concern implementation plan. Kubernetes poses an interesting challenge in that it involves a significant number of roles typically found across an IT organization. Networking, identity, SecOps, governance, workload teams, cluster operations, deployment pipelines, any many more. If you're looking for a starting point on how you might consider breaking up the roles that are adjacent to the AKS cluster, consider reviewing our Azure AD role guide shipped as part of this reference implementation.

📓 See Azure Architecture Center guidance for PCI-DSS 3.2.1 Requirement 7, 8, and 9 in AKS.

Is that all, what about … !?

Yes, there are concerns that do extend beyond what this implementation could reasonably demonstrate for a general audience. This reference implementation strived to be accessible for most people without putting undo burdens on the subscription brought to this walkthrough. This means SKU choices with relatively large default quotas, not using features that have very limited regional availability, not asking for learners to be overwhelmed with "Bring your own encryption key" options for services, and similar. All in hopes that more people can complete this walkthrough without disruption or excessive coordination with subscription or management group owners.

For your implementation, take this starting point and please add on additional security measures talked about throughout the walkthrough and the Azure Architecture Center guidance that were not directly implemented. For example, enable JIT and Conditional Access Policies, leverage Encryption-at-Host features if applicable to your workload, etc.

For a list of additional considerations for your architecture, please see our Additional Considerations document.

Cost

This reference implementation runs idle around $95 (US Dollars) per day within the first 30 days; and you can expect it to increase over time as some Security Center tooling has free-trial period and logs will continue to accrue. The largest contributors to the starting cost are Azure Firewall, the AKS nodepools (VM Scale Sets), and Log Analytics. While some costs are usually cluster operator costs, such as nodepool VMSS, log analytics, incremental Azure Defender costs; others will likely be amortized across multiple business units and/or applications, such as Azure Firewall.

Some customers will opt to amortize cluster costs across workloads by hosting a multi-tenant cluster within their organization, maximizing density with workload diversity. Doing so with regulated workloads is not advised. Regulated environments will generally prioritize compliance and security (isolation) over cost (diverse density).

Final thoughts

Kubernetes is a very flexible platform, giving infrastructure and application operators many choices to achieve their business and technology objectives. At points along your journey, you will need to consider when to take dependencies on Azure platform features, CNCF OSS solutions, ISV solutions, support channels, and what operational processes need to be in place. We encourage this reference implementation to be the place you start architectural conversations within your own team; adapting to your specific requirements, and ultimately delivering a solution that delights your customers and your auditors.

Related documentation

Contributions

Please see our contributor guide.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

With ❤️ from Microsoft Patterns & Practices, Azure Architecture Center.