URL has been copied successfully!
URL has been copied successfully!
URL has been copied successfully!
URL has been copied successfully!
URL has been copied successfully!
Share:
Twitter
LinkedIn
Facebook
Reddit
Follow by Email
Copy link
Threads
Bluesky
Reading Time: 4 minutes

Every so often, Microsoft quietly updates an existing docs page and sneaks in a feature that solves a problem you didn’t even realise was fixable. The MaxUnavailable fallback is one of those. I picked it up through the AKS Docs Tracker I run here on pixelrobots.co.uk, which flagged a change to the rolling upgrade documentation. When I dug into what had changed, it immediately made me think of all the times I’ve seen node pool upgrades stall or fail—not because of anything you did wrong, but because Azure just couldn’t give you the extra surge capacity you asked for.

If you’ve ever sat watching an upgrade hang, wondering if you have enough VM quota, regional capacity, or subnet IPs, you’ll know the pain. Surge upgrades are great—until they aren’t. And when they aren’t, you’re left with a binary choice. Either the upgrade works with your surge value, or it doesn’t happen at all. That’s not a fun place to be, especially in production.

The MaxUnavailable fallback changes that story. It gives AKS a smarter, more flexible upgrade path that can adapt to what Azure can actually provide at the time. In this post, I’ll break down what it is, why it matters, and how you can try it out for yourself.

Why this matters

Most production AKS clusters I see are set up to use surge upgrades. Here’s a typical command you’ll often see, setting the surge value to 33%, which is a good balance between speed and resource usage for most production clusters.

This works well until it doesn’t. If you don’t have enough quota, regional VM capacity, or free IPs in your subnet, the upgrade stalls. And when that happens you’re stuck. Either Azure can provision the surge nodes or the upgrade doesn’t proceed.

The MaxUnavailable fallback changes that. Instead of treating surge as an all-or-nothing requirement, AKS can now try your preferred surge value, fall back to a single surge node if that’s not possible, and if even that fails, fall back to an in-place upgrade using maxUnavailable. You get a more resilient upgrade path without having to decide the strategy in advance.

How the MaxUnavailable fallback works

You enable it by setting both maxSurge and maxUnavailable on your node pool. When both values are greater than zero, AKS follows this fallback strategy during upgrades:

  1. Try your configured maxSurge value first.
  2. If that’s not possible due to quota or capacity constraints, try a surge of just one node. This step only applies to agent pools running Kubernetes 1.35 or later.
  3. If that also fails, fall back to an in-place upgrade using maxUnavailable, cordoning and draining existing nodes without adding new ones.

You’ll need the aks-preview Azure CLI extension and Azure CLI 2.34.1 or later. If you haven’t got the extension already, install and update it before you start:

Make sure your control plane is already on the target Kubernetes version before touching node pools. You can’t upgrade a node pool to a version higher than the control plane.

Here’s the command to configure the MaxUnavailable fallback on an existing node pool:

Once that’s done, verify the settings applied:

You should see both maxSurge and maxUnavailable returned. If not, double-check your CLI version and extension.

Running an upgrade

You can also pass both values directly in the upgrade command. The settings will be saved on the node pool and used for future upgrades too, so this isn’t just a one-off override. This upgrades to a specific Kubernetes version with the fallback strategy in place:

For a node image upgrade only, add --node-image-only and drop the --kubernetes-version flag. Everything else stays the same.

Watching what happens

While the upgrade runs, filter for relevant events to see whether AKS is surging or falling back:

Keep an eye on node status in another terminal at the same time:

These two together give you a real-time view of whether AKS is using surge nodes or falling back to your maxUnavailable setting.

Things to watch out for

maxUnavailable behaves differently from surge and it’s worth being clear on this before you rely on the fallback. With surge, AKS provisions extra nodes before draining anything, so pods have somewhere to land. With maxUnavailable, AKS cordons existing nodes and evicts pods into a pool that’s already under pressure. That means your Pod Disruption Budgets are more likely to block the drain, because there’s simply less room for pods to be rescheduled. If you’ve got tight PDBs and a busy node pool, the fallback path can stall just as easily as a failed surge.

It’s also worth knowing that you cannot use maxUnavailable on system node pools. This only applies to user node pools.

Before you test this, check whether your cluster has room to handle nodes being taken out of service. These three commands give you a quick picture:

kubectl get pdb -A shows all Pod Disruption Budgets across namespaces. If any have ALLOWED DISRUPTIONS at zero, the drain will block. kubectl top nodes shows current CPU and memory usage per node. If nodes are already running hot, evicting pods from one will put pressure on the rest. kubectl get nodes confirms how many nodes you have and their current state, so you know whether there’s realistic capacity to absorb the disruption.

Start with --max-surge 33% --max-unavailable 1 to keep surge as the preferred path and limit the fallback to one unavailable node at a time. Only increase maxUnavailable if you’ve tested it first. This is still a preview feature, so I’d recommend trying it in a non-production cluster before relying on it anywhere critical.

Wrapping up

I’ve seen enough AKS upgrades fail or drag on because of capacity issues to know how frustrating it can be. The MaxUnavailable fallback is a small but useful addition to the AKS upgrade story. It lets AKS try the normal surge upgrade path first, but if Azure can’t allocate the extra capacity, it falls back to a more controlled in-place upgrade using maxUnavailable.

If you’re running production clusters, or even just testing out new upgrade strategies, this is definitely worth a look. It gives you a more flexible upgrade path, and you don’t have to choose between surge and in-place upgrades up front. I’ll be testing it in a few of my own clusters and watching closely to see how it behaves in the real world. If you try it, let me know how it goes. I am always interested to hear real-world stories!

Share:
Twitter
LinkedIn
Facebook
Reddit
Follow by Email
Copy link
Threads
Bluesky

Pixel Robots.

I’m Richard Hooper aka Pixel Robots. I started this blog in 2016 for a couple reasons. The first reason was basically just a place for me to store my step by step guides, troubleshooting guides and just plain ideas about being a sysadmin. The second reason was to share what I have learned and found out with other people like me. Hopefully, you can find something useful on the site.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *