This reference implementation will go over some design decisions from the baseline to detail them as a well as incorporate some new recommended infrastructure options for a Multi Cluster architecture. In this opportunity, this implementation and document are meant to guide the multiple distinct teams introduced in the AKS Baseline through the process of expanding from a single cluster to a multi-cluster solution with a fundamental driver in mind which is Reliability via the Geode cloud design pattern.
Throughout the reference implementation, you will see reference to Contoso Bicycle. They are a fictional, small, and fast-growing startup that provides online web services to its clientele on the east coast of the United States. This narrative provides grounding for some implementation details, naming conventions, etc. You should adapt as you see fit.
🎓 Foundational Understanding |
---|
If you haven't familiarized yourself with the general-purpose AKS baseline cluster architecture, you should start there before continuing here. The architecture rationalized and constructed that implementation is the direct foundation of this body of work. This reference implementation avoids rearticulating points that are already addressed in the AKS baseline cluster. |
The Contoso Bicycle app team that owns the a0042
workload app is planning to deploy an AKS cluster strategically located in the East US 2
region as this is where most of their customer base can be found. They will operate this single AKS cluster following Microsoft's recommended baseline architecture.
AKS Baseline clusters are meant to be available from different Zones within the same region. But now they realize that if East US 2
went fully down, zone coverage is not sufficient. Even though the SLA(s) are acceptable for their business continuity plan, they are starting to think what their options are, and how their stateless application (Application ID: a0042) could increase its availability in case of a complete regional outage. They started conversations with the business unit (BU0001) to increment the number of clusters by one. In other words, they are proposing to move to a multi-cluster infrastructure solution in which multiple instances of the same application could live.
This architectural decision will have multiple implications for the Contoso Bicycle organization. It is not just about following the baseline twice chaining the region to get a twin infrastructure. They also need to look for how they can efficiently share Azure resources as well as detect those that need to be added; how they are going to deploy more than one cluster as well as operate them; decide to which specific regions they deploy to; and many more considerations striving for higher availability.
This project has a companion set of articles that describe challenges, design patterns, and best practices for an AKS multi cluster solution designed to be deployed in multiple region to be highly available. You can find this article on the Azure Architecture Center at Azure Kubernetes Service (AKS) Baseline Cluster for Multi-Region deployments. If you haven't reviewed it, we suggest you read it as it will give added context to the considerations applied in this implementation. Ultimately, this is the direct implementation of that specific architectural guidance.
🚧 | The article series mentioned above has not yet been published. |
---|
This architecture is infrastructure focused, more so than on workload. It concentrates on two AKS clusters, including concerns like multi-region deployments, the desired state of the clusters, geo-replication, network topologies, and more.
The implementation presented here, like in the baseline, is the minimum recommended starting (baseline) for a multiple AKS cluster solution. This implementation integrates with Azure services that will deliver geo-replication, a centralized observability approach, a network topology that is going go with multi-regional growth, and an added benefit of additional traffic balancing as well.
Finally, this implementation uses the ASP.NET Docker samples as an example workload. This workload is purposefully uninteresting, as it is here exclusively to help you experience the baseline infrastructure.
- Azure Kubernetes Service (AKS) v1.20
- Azure Virtual Networks (hub-spoke)
- Azure Front Door
- Azure Application Gateway (WAF)
- Azure Container Registry
- Azure Monitor Log Analytics
- Flux GitOps Operator
- Traefik Ingress Controller
- Azure AD Pod Identity
- Azure KeyVault Secret Store CSI Provider
- Kured
🚧 | Diagram below does NOT accurately reflect this architecture. Update Pending. |
---|
- Begin by ensuring you install and meet the prerequisites
- Plan your Azure Active Directory integration
- Build the hub-spoke network
- Procure client-facing and AKS Ingress Controller TLS certificates
- Deploy the shared services for your clusters
- Deploy the two AKS clusters and supporting services
- Just like the cluster, there are workload prerequisites to address
- Configure AKS Ingress Controller with Azure Key Vault integration
- Deploy the workload
- Perform end-to-end deployment validation
- Cleanup all resources
The main cost on the current Reference Implementation is related to (in order):
- Azure Firewall dedicated to control outbound traffic ~35%
- Node Pool Virtual Machines used inside the cluster ~30%
- AppGateway which control the ingress traffic to the private virtual network ~15%
- Log Analytics ~10%
Azure Firewall can be a shared resource, and maybe your company already has one and you can reuse. It is not recommended, but if you want to reduce cost, you can delete the Azure Firewall and take the risk.
The Virtual Machines on the AKS Cluster are needed. The Cluster can be shared by several applications. Anyway, you can analyze the size and the amount of nodes. The Reference Implementation has the minimum recommended nodes for production environments, but in a multi-cluster environment when you have at least two clusters, based on your traffic analysis, failover strategy and autoscaling configuration, you choose different numbers.
Keep an eye on Log Analytics as time goes by and manage the information which is collected. The main cost is related to data ingestion into the Log Analytics workspace, you can fine tune that.
There is WAF protection enabled on Application Gateway and Azure Front Door. The WAF rules on Azure Front Door have extra cost, you can disable these rules. The consequence is that not valid traffic will arrive at Application Gateway using resources instead of being eliminated as soon as possible.
While this reference implementation tends to avoid preview features of AKS to ensure you have the best customer support experience; there are some features you may wish to evaluate in pre-production clusters that augment your posture around security, manageability, etc. Consider trying out and providing feedback on the following. As these features come out of preview, this reference implementation may be updated to incorporate them.
- Preview features coming from the AKS Secure Baseline
- Currently the Azure Kubernetes Service (AKS) for Multi-Region Deployment does not implement any Preview feature directly
This reference implementation intentionally does not cover all scenarios. If you are looking for other topics that are not addressed here, please visit AKS Secure Baseline for the complete list of covered scenarios around AKS.
- Azure Kubernetes Service Documentation
- Microsoft Azure Well-Architected Framework
- Microservices architecture on AKS
Please see our contributor guide.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
With ❤️ from Microsoft Patterns & Practices, Azure Architecture Center.