If you’re running an Azure Kubernetes Service (AKS) cluster, you know how critical it is to monitor your node pools, especially when they hit their maximum node count and can’t scale further.
Recently, I faced the challenge of building an Azure alert that dynamically checks every node pool in an AKS cluster and fires if any of them hit their maxCount
. Along the way, I ran into some tricky issues, from missing data and parsing problems to permission errors. In this guide, I’ll show you how to set up an AKS node pool maxCount alert to monitor when your cluster reaches its autoscaling limit.
Why Monitor Node Pool Maximums
In AKS, node pools are groups of VMs that run your Kubernetes workloads. With auto-scaling enabled, AKS automatically adds or removes nodes based on demand, but only up to a maximum (maxCount
). If a node pool hits that limit, it won’t scale any further—leading to performance bottlenecks or even failed deployments.
Setting up alerts based on hardcoded thresholds is not ideal. The better approach is to create a dynamic alert that:
- Checks all node pools in your cluster
- Looks up each pool’s actual
maxCount
from Azure - Compares it to the current number of Ready nodes
- Alerts you when a pool hits or exceeds that maximum
The Plan: Combine ARG with Log Analytics
The approach was straightforward in theory:
- Use Azure Resource Graph to retrieve each pool’s
maxCount
- Use Log Analytics (KQL) to count how many nodes are actually running
- Join those results and alert when any pool is full
The idea was to join ARG data (node pool configs) with Log Analytics data (node counts) in a KQL query, then set up an alert to notify me when any node pool hits its limit. Sounds simple, right? Well, Azure had other plans!
The Journey: Hitting Roadblocks
Here are the key issues I ran into, and how I solved them.
Roadblock 1: ARG Permissions
The first issue came up immediately. When I tried to run the resource graph query in Log Analytics:
1 2 3 4 5 |
arg("").Resources | where type == 'microsoft.containerservice/managedclusters' and name == 'aks-pixelrobots-dev' | extend agentPools = properties.agentPoolProfiles | mv-expand agentPool = agentPools | where agentPool.enableAutoScaling == true |
Errors occurred while resolving remote entities. Cluster=’https://ade.loganalytics.io/…’: not authorized to access resource name: AzureResourceGraph…
Turns out, the arg("")
function only works if:
- You (or the alert identity) have Reader access on the subscription
- The workspace supports ARG queries
The fix:
- Assign Reader to my user and to the managed identity of the alert
- Also assign Log Analytics Reader for workspace access
Roadblock 2: mv-expand Limits
To extract each agent pool from the cluster properties, I used mv-expand
. But then I hit this error:
Azure Resource Graph: At least one mvexpand operator has a row limit of 2147483647, which exceeds the max limit of 2000.
Even though my cluster only had a few pools, mv-expand
tried to over-fetch. The fix was simple—add a limit to the expansion.
Here’s the pattern I used:
1 |
| mv-expand agentPool = agentPools limit 50 |
This tells the engine to expand only up to 50 pools, which is more than enough for a typical AKS cluster.
Roadblock 3: Empty nodePoolName Values
Next, I had trouble joining the ARG data with the live node data. Specifically, I couldn’t get a valid node pool name from KubeNodeInventory
.
Originally, I tried:
1 2 3 4 5 |
KubeNodeInventory | where TimeGenerated > ago(5m) | where ClusterName == 'aks-pixelrobots-dev' | extend parsedLabels = parse_json(Labels) | extend nodePoolName = tostring(parsedLabels["agentpool"]) |
But this returned blank values. The problem? Labels
was an array, not a flat object.
Here’s the fix that worked:
1 |
| extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool)) |
This accesses the first object in the array, where the agentpool
label lives. Adding tolower()
ensures consistent casing when joining with ARG data.
Roadblock 4: Wildly Inaccurate Node Counts
Even after everything looked good, the counts were way off. Some node pools showed 200%+ usage, which is clearly wrong.
The mistake? I used count()
to tally the nodes:
1 |
| summarize currentNodeCount = count() by ClusterName, nodePoolName |
count() tallied all rows, including duplicates (e.g., multiple status updates per node). Plus, I wasn’t filtering for Ready nodes. I switched to:
1 2 |
| where Status == 'Ready' | summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName |
dcount(Computer)
counts unique nodes, and Status == 'Ready'
ensures only healthy nodes are included. This brought counts back to reality (e.g., 3 nodes for systempool).
The Final Solution
Once all the pieces were working, I ended up with two queries: one for monitoring and one for alerting.
Monitoring Query
This query shows all auto-scaling node pools in your cluster and their current usage, so you can track everything in one view.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
let clusterName = "aks-pixelrobots-dev"; let maxNodes = arg("").Resources | where type == 'microsoft.containerservice/managedclusters' and name == clusterName | extend agentPools = properties.agentPoolProfiles | mv-expand agentPool = agentPools limit 50 | where agentPool.enableAutoScaling == true | project clusterName = name, nodePoolName = tolower(tostring(agentPool["name"])), maxCount = toint(agentPool["maxCount"]); let currentNodes = KubeNodeInventory | where TimeGenerated > ago(5m) | where ClusterName == clusterName | where Status == 'Ready' | extend parsedLabels = parse_json(Labels) | extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool)) | where nodePoolName != '' | summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName; maxNodes | join kind=leftouter currentNodes on $left.nodePoolName == $right.nodePoolName | project clusterName, nodePoolName, currentNodeCount = coalesce(currentNodeCount, 0), maxCount, usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0), isAtMax = coalesce(currentNodeCount, 0) >= maxCount | order by isAtMax desc, usagePercent desc |
This gives a full overview of your cluster’s node pool usage.
1 2 3 4 |
clusterName nodePoolName currentNodeCount maxCount usagePercent isAtMax aks-pixelrobots-dev systempool 3 3 100.0 true aks-pixelrobots-dev devops 10 12 83.33 false aks-pixelrobots-dev pronodes 5 30 16.67 false |
Alert Query
This version filters the result to only include node pools that are at or over their maxCount
. Use this one to power your alert rule.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
let clusterName = "aks-pixelrobots-dev"; let maxNodes = arg("").Resources | where type == 'microsoft.containerservice/managedclusters' and name == clusterName | extend agentPools = properties.agentPoolProfiles | mv-expand agentPool = agentPools limit 50 | where agentPool.enableAutoScaling == true | project clusterName = name, nodePoolName = tolower(tostring(agentPool["name"])), maxCount = toint(agentPool["maxCount"]); let currentNodes = KubeNodeInventory | where TimeGenerated > ago(5m) | where ClusterName == clusterName | where Status == 'Ready' | extend parsedLabels = parse_json(Labels) | extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool)) | where nodePoolName != '' | summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName; maxNodes | join kind=leftouter currentNodes on $left.nodePoolName == $right.nodePoolName | where coalesce(currentNodeCount, 0) >= maxCount | project clusterName, nodePoolName, currentNodeCount = coalesce(currentNodeCount, 0), maxCount, usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0), isAtMax = coalesce(currentNodeCount, 0) >= maxCount | order by usagePercent desc |
Setting Up the Alert
Once you have the alert query working, setting up the actual alert in Azure Monitor is straightforward.
- Go to your Log Analytics workspace
- Paste in the alert query and run it
- Click “New alert rule”
- For the condition, use a custom log search
- Set it to trigger when results are greater than 0
- Set frequency and evaluation to 5 minutes
- Link to an action group with your preferred notifications
- Name your rule and save it
Don’t forget to assign the correct permissions (Reader + Log Analytics Reader) to the alert’s managed identity.
Why This Setup Is Worth It
With this solution in place, you’re getting:
- Real-time alerts based on live config and current usage
- Accurate node counting by filtering only Ready nodes
- Dynamic monitoring across all node pools, not just hardcoded ones
- A clean, reusable query setup you can tweak anytime
Tips and Ideas
- Want to monitor multiple clusters? Remove the
name == clusterName
filter - Want to track trends? Use the monitoring query in a workbook or dashboard
- Want to route alerts to Teams or Slack? Add a webhook to your action group
Wrapping Up
This was a fun and educational deep dive into the power of combining Azure Resource Graph with KQL. If you’re running AKS and care about autoscaling, this alert setup gives you visibility and protection when things hit their limits.
Let me know if you give it a try or if you run into similar roadblocks. I’d love to hear how others are solving this challenge in their own clusters.
Happy scaling!
0 Comments