Every so often, Microsoft quietly updates an existing docs page and sneaks in a feature that solves a problem you didn’t even realise was fixable. The MaxUnavailable fallback is one of those. I picked it up through the AKS Docs Tracker I run here on pixelrobots.co.uk, which flagged a change to the rolling upgrade documentation. When I dug into what had changed, it immediately made me think of all the times I’ve seen node pool upgrades stall or fail—not because of anything you did wrong, but because Azure just couldn’t give you the extra surge capacity you asked for.
If you’ve ever sat watching an upgrade hang, wondering if you have enough VM quota, regional capacity, or subnet IPs, you’ll know the pain. Surge upgrades are great—until they aren’t. And when they aren’t, you’re left with a binary choice. Either the upgrade works with your surge value, or it doesn’t happen at all. That’s not a fun place to be, especially in production.
The MaxUnavailable fallback changes that story. It gives AKS a smarter, more flexible upgrade path that can adapt to what Azure can actually provide at the time. In this post, I’ll break down what it is, why it matters, and how you can try it out for yourself.
Why this matters
Most production AKS clusters I see are set up to use surge upgrades. Here’s a typical command you’ll often see, setting the surge value to 33%, which is a good balance between speed and resource usage for most production clusters.
|
1 2 3 4 5 |
az aks nodepool update \ --resource-group rg-prod \ --cluster-name aks-prod \ --name userpool \ --max-surge 33% |
This works well until it doesn’t. If you don’t have enough quota, regional VM capacity, or free IPs in your subnet, the upgrade stalls. And when that happens you’re stuck. Either Azure can provision the surge nodes or the upgrade doesn’t proceed.
The MaxUnavailable fallback changes that. Instead of treating surge as an all-or-nothing requirement, AKS can now try your preferred surge value, fall back to a single surge node if that’s not possible, and if even that fails, fall back to an in-place upgrade using maxUnavailable. You get a more resilient upgrade path without having to decide the strategy in advance.
How the MaxUnavailable fallback works
You enable it by setting both maxSurge and maxUnavailable on your node pool. When both values are greater than zero, AKS follows this fallback strategy during upgrades:
- Try your configured
maxSurgevalue first. - If that’s not possible due to quota or capacity constraints, try a surge of just one node. This step only applies to agent pools running Kubernetes 1.35 or later.
- If that also fails, fall back to an in-place upgrade using
maxUnavailable, cordoning and draining existing nodes without adding new ones.
You’ll need the aks-preview Azure CLI extension and Azure CLI 2.34.1 or later. If you haven’t got the extension already, install and update it before you start:
|
1 2 |
az extension add --name aks-preview az extension update --name aks-preview |
Make sure your control plane is already on the target Kubernetes version before touching node pools. You can’t upgrade a node pool to a version higher than the control plane.
Here’s the command to configure the MaxUnavailable fallback on an existing node pool:
|
1 2 3 4 5 6 |
az aks nodepool update \ --resource-group rg-prod \ --cluster-name aks-prod \ --name userpool \ --max-surge 33% \ --max-unavailable 1 |
Once that’s done, verify the settings applied:
|
1 2 3 4 5 |
az aks nodepool show \ --resource-group rg-prod \ --cluster-name aks-prod \ --name userpool \ --query upgradeSettings |
You should see both maxSurge and maxUnavailable returned. If not, double-check your CLI version and extension.
Running an upgrade
You can also pass both values directly in the upgrade command. The settings will be saved on the node pool and used for future upgrades too, so this isn’t just a one-off override. This upgrades to a specific Kubernetes version with the fallback strategy in place:
|
1 2 3 4 5 6 7 |
az aks nodepool upgrade \ --resource-group rg-prod \ --cluster-name aks-prod \ --name userpool \ --kubernetes-version 1.31.5 \ --max-surge 33% \ --max-unavailable 1 |
For a node image upgrade only, add --node-image-only and drop the --kubernetes-version flag. Everything else stays the same.
Watching what happens
While the upgrade runs, filter for relevant events to see whether AKS is surging or falling back:
|
1 |
kubectl get events --field-selector reason=Drain,reason=Surge,reason=Upgrade |
Keep an eye on node status in another terminal at the same time:
|
1 |
kubectl get nodes -w |
These two together give you a real-time view of whether AKS is using surge nodes or falling back to your maxUnavailable setting.
Things to watch out for
maxUnavailable behaves differently from surge and it’s worth being clear on this before you rely on the fallback. With surge, AKS provisions extra nodes before draining anything, so pods have somewhere to land. With maxUnavailable, AKS cordons existing nodes and evicts pods into a pool that’s already under pressure. That means your Pod Disruption Budgets are more likely to block the drain, because there’s simply less room for pods to be rescheduled. If you’ve got tight PDBs and a busy node pool, the fallback path can stall just as easily as a failed surge.
It’s also worth knowing that you cannot use maxUnavailable on system node pools. This only applies to user node pools.
Before you test this, check whether your cluster has room to handle nodes being taken out of service. These three commands give you a quick picture:
|
1 2 3 |
kubectl get pdb -A kubectl top nodes kubectl get nodes |
kubectl get pdb -A shows all Pod Disruption Budgets across namespaces. If any have ALLOWED DISRUPTIONS at zero, the drain will block. kubectl top nodes shows current CPU and memory usage per node. If nodes are already running hot, evicting pods from one will put pressure on the rest. kubectl get nodes confirms how many nodes you have and their current state, so you know whether there’s realistic capacity to absorb the disruption.
Start with --max-surge 33% --max-unavailable 1 to keep surge as the preferred path and limit the fallback to one unavailable node at a time. Only increase maxUnavailable if you’ve tested it first. This is still a preview feature, so I’d recommend trying it in a non-production cluster before relying on it anywhere critical.
Wrapping up
I’ve seen enough AKS upgrades fail or drag on because of capacity issues to know how frustrating it can be. The MaxUnavailable fallback is a small but useful addition to the AKS upgrade story. It lets AKS try the normal surge upgrade path first, but if Azure can’t allocate the extra capacity, it falls back to a more controlled in-place upgrade using maxUnavailable.
If you’re running production clusters, or even just testing out new upgrade strategies, this is definitely worth a look. It gives you a more flexible upgrade path, and you don’t have to choose between surge and in-place upgrades up front. I’ll be testing it in a few of my own clusters and watching closely to see how it behaves in the real world. If you try it, let me know how it goes. I am always interested to hear real-world stories!
0 Comments