AKS Preview Feature: Optimizing Upgrades with Undrainable Node Behavior

URL has been copied successfully!

Reading Time: 3 minutes

Azure Kubernetes Service (AKS) has just released a new preview feature called Optimize for undrainable node behavior, designed to make the upgrade process smoother. This feature lets you control how upgrades handle nodes that can’t be drained, reducing disruptions and keeping your cluster running more predictably during upgrades.

If you’ve read my previous post on Optimizing AKS Upgrades to Improve Performance and Minimize Disruptions, you already know the importance of effective upgrade management in AKS. This new feature builds on those practices, introducing two configurable behaviors for handling undrainable nodes.

What is the Optimize for Undrainable Node Behavior Feature?

During upgrades, AKS needs to drain nodes to perform updates. When nodes fail to drain, the default behavior, Schedule, stops the upgrade process and leaves the node in a schedulable state. The new Cordon behavior changes this by allowing AKS to quarantine undrainable nodes and continue with the rest of the upgrade.

Here’s a quick comparison of these two behaviors:

Schedule (Default): If a node fails to drain, the upgrade operation stops, and the node remains schedulable.
Cordon: If a node fails to drain, AKS marks it with the label kubernetes.azure.com/upgrade-status: Quarantined, unschedules it for pods, and isolates it from the upgrade. The upgrade then proceeds with other nodes, letting you troubleshoot quarantined nodes without affecting the upgrade for the entire node pool.

Why Use the Cordon Behavior?

If you’re running applications with strict PodDisruptionBudgets or managing large, complex clusters, there’s a good chance you’ve faced issues with undrainable nodes during upgrades. The Cordon option lets you handle these nodes separately, reducing upgrade interruptions and giving you more time to investigate issues.

How to Enable and Use the Cordon Behavior

To start using this feature, you’ll need the latest aks-preview extension. Follow these steps:

Step 1: Update or Install `aks-preview` Extension (9.0.0b3 or later)

Open your Azure CLI and run:

az extension update --name aks-preview
az extension add --name aks-preview

1 2	az extension update --name aks-preview az extension add --name aks-preview

Step 2: Set Undrainable Node Behavior to Cordon

Update the node pool behavior with the following command. Be sure to replace pixelrobots, pixelpool, and pixelrg with your own cluster name, node pool name, and resource group name:

az aks nodepool update --cluster-name pixelrobots --name pixelpool --resource-group pixelrg --max-surge 1 --undrainable-node-behavior Cordon

1	az aks nodepool update --cluster-name pixelrobots --name pixelpool --resource-group pixelrg --max-surge 1 --undrainable-node-behavior Cordon

Step 3: Verify Quarantined Nodes

After enabling Cordon, any nodes that fail to drain will be labeled as quarantined. Run this command to check:

kubectl get nodes --show-labels=true

1	kubectl get nodes --show-labels=true

AKS marks quarantined nodes with the label kubernetes.azure.com/upgrade-status: Quarantined, unscheduling them for pods and isolating them from the upgrade.

Troubleshooting Quarantined Nodes

If you have nodes quarantined after an upgrade, follow these steps to troubleshoot and restore your node pool:

Identify and Address Blocked Nodes

kubectl delete pdb <your-pdb-name>

1	kubectl delete pdb <your-pdb-name>

Delete Blocked Nodes (If Necessary)

az aks nodepool delete-machines --cluster-name pixelrobots --machine-names pixelpool-vmss000001 --name pixelpool --resource-group pixelrg

1	az aks nodepool delete-machines --cluster-name pixelrobots --machine-names pixelpool-vmss000001 --name pixelpool --resource-group pixelrg

Reconcile Node Pool Size

az aks nodepool scale --resource-group pixelrg --cluster-name pixelrobots --name pixelpool --node-count 2

1	az aks nodepool scale --resource-group pixelrg --cluster-name pixelrobots --name pixelpool --node-count 2

When Should You Use Cordon vs. Schedule?

The Cordon behavior is useful for complex environments where pod availability is sensitive and upgrades can’t afford interruptions. Here are two situations where it might help:

Applications with Strict PodDisruptionBudgets: Prevents PDBs from causing upgrade issues.
Multi-zone Clusters: Useful when you want the flexibility to troubleshoot specific nodes without affecting the rest.

If your cluster generally upgrades smoothly without issues, the default Schedule behavior may be sufficient. But if you’ve faced issues with node drain failures in the past, it’s worth testing the Cordon option.

Final Thoughts

This new feature adds another layer of control to AKS upgrades, letting you manage undrainable nodes without halting the entire upgrade process. For a deeper dive into optimizing AKS upgrades, check out my previous post, Optimizing AKS Upgrades to Improve Performance and Minimize Disruptions.

Give the Cordon behavior a try if you’re running into upgrade issues. While it may not be necessary for every AKS setup, it’s a valuable tool for handling complex upgrade scenarios with minimal disruption.

AKS Preview Feature: Optimizing Upgrades with Undrainable Node Behavior

Published by Pixel Robots. on November 6, 2024 November 6, 2024

What is the Optimize for Undrainable Node Behavior Feature?

Why Use the Cordon Behavior?

How to Enable and Use the Cordon Behavior

Step 1: Update or Install `aks-preview` Extension (9.0.0b3 or later)

Step 2: Set Undrainable Node Behavior to Cordon

Step 3: Verify Quarantined Nodes

Troubleshooting Quarantined Nodes

When Should You Use Cordon vs. Schedule?

Final Thoughts

Pixel Robots.

0 Comments

Leave a Reply Cancel reply

Inspektor Gadget Is Now an AKS Extension (Preview)

Azure Container Linux for AKS: Flatcar Grows Up

AKS MaxUnavailable Fallback Now in Preview

AKS Preview Feature: Optimizing Upgrades with Undrainable Node Behavior

Published by Pixel Robots. on November 6, 2024 November 6, 2024

What is the Optimize for Undrainable Node Behavior Feature?

Why Use the Cordon Behavior?

How to Enable and Use the Cordon Behavior

Step 1: Update or Install aks-preview Extension (9.0.0b3 or later)

Step 2: Set Undrainable Node Behavior to Cordon

Step 3: Verify Quarantined Nodes

Troubleshooting Quarantined Nodes

When Should You Use Cordon vs. Schedule?

Final Thoughts

Pixel Robots.

0 Comments

Leave a Reply Cancel reply

Related Posts

Inspektor Gadget Is Now an AKS Extension (Preview)

Azure Container Linux for AKS: Flatcar Grows Up

AKS MaxUnavailable Fallback Now in Preview

Step 1: Update or Install `aks-preview` Extension (9.0.0b3 or later)