Azure Kubernetes Service (AKS) has just released a new preview feature called Optimize for undrainable node behavior, designed to make the upgrade process smoother. This feature lets you control how upgrades handle nodes that can’t be drained, reducing disruptions and keeping your cluster running more predictably during upgrades.
If you’ve read my previous post on Optimizing AKS Upgrades to Improve Performance and Minimize Disruptions, you already know the importance of effective upgrade management in AKS. This new feature builds on those practices, introducing two configurable behaviors for handling undrainable nodes.
What is the Optimize for Undrainable Node Behavior Feature?
During upgrades, AKS needs to drain nodes to perform updates. When nodes fail to drain, the default behavior, Schedule, stops the upgrade process and leaves the node in a schedulable state. The new Cordon behavior changes this by allowing AKS to quarantine undrainable nodes and continue with the rest of the upgrade.
Here’s a quick comparison of these two behaviors:
- Schedule (Default): If a node fails to drain, the upgrade operation stops, and the node remains schedulable.
- Cordon: If a node fails to drain, AKS marks it with the label
kubernetes.azure.com/upgrade-status: Quarantined
, unschedules it for pods, and isolates it from the upgrade. The upgrade then proceeds with other nodes, letting you troubleshoot quarantined nodes without affecting the upgrade for the entire node pool.
Why Use the Cordon Behavior?
If you’re running applications with strict PodDisruptionBudgets or managing large, complex clusters, there’s a good chance you’ve faced issues with undrainable nodes during upgrades. The Cordon option lets you handle these nodes separately, reducing upgrade interruptions and giving you more time to investigate issues.
How to Enable and Use the Cordon Behavior
To start using this feature, you’ll need the latest aks-preview
extension. Follow these steps:
Step 1: Update or Install aks-preview
Extension (9.0.0b3 or later)
Open your Azure CLI and run:
1 2 |
az extension update --name aks-preview az extension add --name aks-preview |
Step 2: Set Undrainable Node Behavior to Cordon
Update the node pool behavior with the following command. Be sure to replace pixelrobots
, pixelpool
, and pixelrg
with your own cluster name, node pool name, and resource group name:
1 |
az aks nodepool update --cluster-name pixelrobots --name pixelpool --resource-group pixelrg --max-surge 1 --undrainable-node-behavior Cordon |
Step 3: Verify Quarantined Nodes
After enabling Cordon, any nodes that fail to drain will be labeled as quarantined. Run this command to check:
1 |
kubectl get nodes --show-labels=true |
AKS marks quarantined nodes with the label kubernetes.azure.com/upgrade-status: Quarantined
, unscheduling them for pods and isolating them from the upgrade.
Troubleshooting Quarantined Nodes
If you have nodes quarantined after an upgrade, follow these steps to troubleshoot and restore your node pool:
- Identify and Address Blocked Nodes
- Check for any policies or configurations that may be blocking the drain. For instance, a strict PodDisruptionBudget (PDB) could prevent node draining. Remove the PDB using:
1 |
kubectl delete pdb <your-pdb-name> |
- Delete Blocked Nodes (If Necessary)
- If a quarantined node is no longer needed, you can remove it entirely from the node pool. Replace
pixelrobots
,pixelpool-vmss000001
,pixelpool
, andpixelrg
with the appropriate names for your setup:
- If a quarantined node is no longer needed, you can remove it entirely from the node pool. Replace
1 |
az aks nodepool delete-machines --cluster-name pixelrobots --machine-names pixelpool-vmss000001 --name pixelpool --resource-group pixelrg |
- Reconcile Node Pool Size
- After troubleshooting or removing blocked nodes, scale the node pool back to its intended size to ensure all nodes are upgraded and the node pool status is restored:
1 |
az aks nodepool scale --resource-group pixelrg --cluster-name pixelrobots --name pixelpool --node-count 2 |
When Should You Use Cordon vs. Schedule?
The Cordon behavior is useful for complex environments where pod availability is sensitive and upgrades can’t afford interruptions. Here are two situations where it might help:
- Applications with Strict PodDisruptionBudgets: Prevents PDBs from causing upgrade issues.
- Multi-zone Clusters: Useful when you want the flexibility to troubleshoot specific nodes without affecting the rest.
If your cluster generally upgrades smoothly without issues, the default Schedule behavior may be sufficient. But if you’ve faced issues with node drain failures in the past, it’s worth testing the Cordon option.
Final Thoughts
This new feature adds another layer of control to AKS upgrades, letting you manage undrainable nodes without halting the entire upgrade process. For a deeper dive into optimizing AKS upgrades, check out my previous post, Optimizing AKS Upgrades to Improve Performance and Minimize Disruptions.
Give the Cordon behavior a try if you’re running into upgrade issues. While it may not be necessary for every AKS setup, it’s a valuable tool for handling complex upgrade scenarios with minimal disruption.
0 Comments