Share:
Twitter
LinkedIn
Facebook
Reddit
Whatsapp
Follow by Email
Reading Time: 3 minutes

Azure Kubernetes Service (AKS) has just released a new preview feature called Optimize for undrainable node behavior, designed to make the upgrade process smoother. This feature lets you control how upgrades handle nodes that can’t be drained, reducing disruptions and keeping your cluster running more predictably during upgrades.

If you’ve read my previous post on Optimizing AKS Upgrades to Improve Performance and Minimize Disruptions, you already know the importance of effective upgrade management in AKS. This new feature builds on those practices, introducing two configurable behaviors for handling undrainable nodes.

What is the Optimize for Undrainable Node Behavior Feature?

During upgrades, AKS needs to drain nodes to perform updates. When nodes fail to drain, the default behavior, Schedule, stops the upgrade process and leaves the node in a schedulable state. The new Cordon behavior changes this by allowing AKS to quarantine undrainable nodes and continue with the rest of the upgrade.

Here’s a quick comparison of these two behaviors:

  • Schedule (Default): If a node fails to drain, the upgrade operation stops, and the node remains schedulable.
  • Cordon: If a node fails to drain, AKS marks it with the label kubernetes.azure.com/upgrade-status: Quarantined, unschedules it for pods, and isolates it from the upgrade. The upgrade then proceeds with other nodes, letting you troubleshoot quarantined nodes without affecting the upgrade for the entire node pool.

Why Use the Cordon Behavior?

If you’re running applications with strict PodDisruptionBudgets or managing large, complex clusters, there’s a good chance you’ve faced issues with undrainable nodes during upgrades. The Cordon option lets you handle these nodes separately, reducing upgrade interruptions and giving you more time to investigate issues.

How to Enable and Use the Cordon Behavior

To start using this feature, you’ll need the latest aks-preview extension. Follow these steps:

Step 1: Update or Install aks-preview Extension (9.0.0b3 or later)

Open your Azure CLI and run:

Step 2: Set Undrainable Node Behavior to Cordon

Update the node pool behavior with the following command. Be sure to replace pixelrobots, pixelpool, and pixelrg with your own cluster name, node pool name, and resource group name:

Step 3: Verify Quarantined Nodes

After enabling Cordon, any nodes that fail to drain will be labeled as quarantined. Run this command to check:

AKS marks quarantined nodes with the label kubernetes.azure.com/upgrade-status: Quarantined, unscheduling them for pods and isolating them from the upgrade.

Troubleshooting Quarantined Nodes

If you have nodes quarantined after an upgrade, follow these steps to troubleshoot and restore your node pool:

  1. Identify and Address Blocked Nodes
    • Check for any policies or configurations that may be blocking the drain. For instance, a strict PodDisruptionBudget (PDB) could prevent node draining. Remove the PDB using:
  1. Delete Blocked Nodes (If Necessary)
    • If a quarantined node is no longer needed, you can remove it entirely from the node pool. Replace pixelrobots, pixelpool-vmss000001, pixelpool, and pixelrg with the appropriate names for your setup:
  1. Reconcile Node Pool Size
    • After troubleshooting or removing blocked nodes, scale the node pool back to its intended size to ensure all nodes are upgraded and the node pool status is restored:

When Should You Use Cordon vs. Schedule?

The Cordon behavior is useful for complex environments where pod availability is sensitive and upgrades can’t afford interruptions. Here are two situations where it might help:

  • Applications with Strict PodDisruptionBudgets: Prevents PDBs from causing upgrade issues.
  • Multi-zone Clusters: Useful when you want the flexibility to troubleshoot specific nodes without affecting the rest.

If your cluster generally upgrades smoothly without issues, the default Schedule behavior may be sufficient. But if you’ve faced issues with node drain failures in the past, it’s worth testing the Cordon option.

Final Thoughts

This new feature adds another layer of control to AKS upgrades, letting you manage undrainable nodes without halting the entire upgrade process. For a deeper dive into optimizing AKS upgrades, check out my previous post, Optimizing AKS Upgrades to Improve Performance and Minimize Disruptions.

Give the Cordon behavior a try if you’re running into upgrade issues. While it may not be necessary for every AKS setup, it’s a valuable tool for handling complex upgrade scenarios with minimal disruption.

Share:
Twitter
LinkedIn
Facebook
Reddit
Whatsapp
Follow by Email

Pixel Robots.

I’m Richard Hooper aka Pixel Robots. I started this blog in 2016 for a couple reasons. The first reason was basically just a place for me to store my step by step guides, troubleshooting guides and just plain ideas about being a sysadmin. The second reason was to share what I have learned and found out with other people like me. Hopefully, you can find something useful on the site.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *