How to Set Up an AKS Node Pool maxCount Alert Using KQL and ARG

URL has been copied successfully!

Reading Time: 5 minutes

If you’re running an Azure Kubernetes Service (AKS) cluster, you know how critical it is to monitor your node pools, especially when they hit their maximum node count and can’t scale further.

Recently, I faced the challenge of building an Azure alert that dynamically checks every node pool in an AKS cluster and fires if any of them hit their maxCount. Along the way, I ran into some tricky issues, from missing data and parsing problems to permission errors. In this guide, I’ll show you how to set up an AKS node pool maxCount alert to monitor when your cluster reaches its autoscaling limit.

Why Monitor Node Pool Maximums

In AKS, node pools are groups of VMs that run your Kubernetes workloads. With auto-scaling enabled, AKS automatically adds or removes nodes based on demand, but only up to a maximum (maxCount). If a node pool hits that limit, it won’t scale any further—leading to performance bottlenecks or even failed deployments.

Setting up alerts based on hardcoded thresholds is not ideal. The better approach is to create a dynamic alert that:

Checks all node pools in your cluster
Looks up each pool’s actual maxCount from Azure
Compares it to the current number of Ready nodes
Alerts you when a pool hits or exceeds that maximum

The Plan: Combine ARG with Log Analytics

The approach was straightforward in theory:

Use Azure Resource Graph to retrieve each pool’s maxCount
Use Log Analytics (KQL) to count how many nodes are actually running
Join those results and alert when any pool is full

The idea was to join ARG data (node pool configs) with Log Analytics data (node counts) in a KQL query, then set up an alert to notify me when any node pool hits its limit. Sounds simple, right? Well, Azure had other plans!

The Journey: Hitting Roadblocks

Here are the key issues I ran into, and how I solved them.

Roadblock 1: ARG Permissions

The first issue came up immediately. When I tried to run the resource graph query in Log Analytics:

arg("").Resources
| where type == 'microsoft.containerservice/managedclusters' and name == 'aks-pixelrobots-dev'
| extend agentPools = properties.agentPoolProfiles
| mv-expand agentPool = agentPools
| where agentPool.enableAutoScaling == true

arg("").Resources

| where type == 'microsoft.containerservice/managedclusters' and name == 'aks-pixelrobots-dev'

| extend agentPools = properties.agentPoolProfiles

| mv-expand agentPool = agentPools

| where agentPool.enableAutoScaling == true

Errors occurred while resolving remote entities. Cluster=’https://ade.loganalytics.io/…’: not authorized to access resource name: AzureResourceGraph…

Turns out, the arg("") function only works if:

You (or the alert identity) have Reader access on the subscription
The workspace supports ARG queries

The fix:

Assign Reader to my user and to the managed identity of the alert
Also assign Log Analytics Reader for workspace access

Roadblock 2: mv-expand Limits

To extract each agent pool from the cluster properties, I used mv-expand. But then I hit this error:

Azure Resource Graph: At least one mvexpand operator has a row limit of 2147483647, which exceeds the max limit of 2000.

Even though my cluster only had a few pools, mv-expand tried to over-fetch. The fix was simple—add a limit to the expansion.

Here’s the pattern I used:

| mv-expand agentPool = agentPools limit 50

1	\| mv-expand agentPool = agentPools limit 50

This tells the engine to expand only up to 50 pools, which is more than enough for a typical AKS cluster.

Roadblock 3: Empty nodePoolName Values

Next, I had trouble joining the ARG data with the live node data. Specifically, I couldn’t get a valid node pool name from KubeNodeInventory.

Originally, I tried:

KubeNodeInventory
| where TimeGenerated > ago(5m)
| where ClusterName == 'aks-pixelrobots-dev'
| extend parsedLabels = parse_json(Labels)
| extend nodePoolName = tostring(parsedLabels["agentpool"])

KubeNodeInventory

| where TimeGenerated > ago(5m)

| where ClusterName == 'aks-pixelrobots-dev'

| extend parsedLabels = parse_json(Labels)

| extend nodePoolName = tostring(parsedLabels["agentpool"])

But this returned blank values. The problem? Labels was an array, not a flat object.

Here’s the fix that worked:

| extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool))

1	\| extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool))

This accesses the first object in the array, where the agentpool label lives. Adding tolower() ensures consistent casing when joining with ARG data.

Roadblock 4: Wildly Inaccurate Node Counts

Even after everything looked good, the counts were way off. Some node pools showed 200%+ usage, which is clearly wrong.

The mistake? I used count() to tally the nodes:

| summarize currentNodeCount = count() by ClusterName, nodePoolName

1	\| summarize currentNodeCount = count() by ClusterName, nodePoolName

count() tallied all rows, including duplicates (e.g., multiple status updates per node). Plus, I wasn’t filtering for Ready nodes. I switched to:

| where Status == 'Ready'
| summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName

1 2	\| where Status == 'Ready' \| summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName

dcount(Computer) counts unique nodes, and Status == 'Ready' ensures only healthy nodes are included. This brought counts back to reality (e.g., 3 nodes for systempool).

The Final Solution

Once all the pieces were working, I ended up with two queries: one for monitoring and one for alerting.

Monitoring Query

This query shows all auto-scaling node pools in your cluster and their current usage, so you can track everything in one view.

let clusterName = "aks-pixelrobots-dev";
let maxNodes = arg("").Resources
| where type == 'microsoft.containerservice/managedclusters' and name == clusterName
| extend agentPools = properties.agentPoolProfiles
| mv-expand agentPool = agentPools limit 50
| where agentPool.enableAutoScaling == true
| project 
    clusterName = name, 
    nodePoolName = tolower(tostring(agentPool["name"])), 
    maxCount = toint(agentPool["maxCount"]);
let currentNodes = KubeNodeInventory
| where TimeGenerated > ago(5m)
| where ClusterName == clusterName
| where Status == 'Ready'
| extend parsedLabels = parse_json(Labels)
| extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool))
| where nodePoolName != ''
| summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName;
maxNodes
| join kind=leftouter currentNodes on $left.nodePoolName == $right.nodePoolName
| project 
    clusterName, 
    nodePoolName, 
    currentNodeCount = coalesce(currentNodeCount, 0),
    maxCount,
    usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0),
    isAtMax = coalesce(currentNodeCount, 0) >= maxCount
| order by isAtMax desc, usagePercent desc

let clusterName = "aks-pixelrobots-dev";

let maxNodes = arg("").Resources

| where type == 'microsoft.containerservice/managedclusters' and name == clusterName

| extend agentPools = properties.agentPoolProfiles

| mv-expand agentPool = agentPools limit 50

| where agentPool.enableAutoScaling == true

| project

clusterName = name,

nodePoolName = tolower(tostring(agentPool["name"])),

maxCount = toint(agentPool["maxCount"]);

let currentNodes = KubeNodeInventory

| where TimeGenerated > ago(5m)

| where ClusterName == clusterName

| where Status == 'Ready'

| extend parsedLabels = parse_json(Labels)

| extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool))

| where nodePoolName != ''

| summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName;

maxNodes

| join kind=leftouter currentNodes on $left.nodePoolName == $right.nodePoolName

| project

clusterName,

nodePoolName,

currentNodeCount = coalesce(currentNodeCount, 0),

maxCount,

usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0),

isAtMax = coalesce(currentNodeCount, 0) >= maxCount

| order by isAtMax desc, usagePercent desc

This gives a full overview of your cluster’s node pool usage.

clusterName          nodePoolName  currentNodeCount  maxCount  usagePercent  isAtMax
aks-pixelrobots-dev  systempool    3                 3         100.0         true
aks-pixelrobots-dev  devops        10                12        83.33         false
aks-pixelrobots-dev  pronodes      5                 30        16.67         false

clusterName nodePoolName currentNodeCount maxCount usagePercent isAtMax

aks-pixelrobots-dev systempool 3 3 100.0 true

aks-pixelrobots-dev devops 10 12 83.33 false

aks-pixelrobots-dev pronodes 5 30 16.67 false

Alert Query

This version filters the result to only include node pools that are at or over their maxCount. Use this one to power your alert rule.

let clusterName = "aks-pixelrobots-dev";
let maxNodes = arg("").Resources
| where type == 'microsoft.containerservice/managedclusters' and name == clusterName
| extend agentPools = properties.agentPoolProfiles
| mv-expand agentPool = agentPools limit 50
| where agentPool.enableAutoScaling == true
| project 
    clusterName = name, 
    nodePoolName = tolower(tostring(agentPool["name"])), 
    maxCount = toint(agentPool["maxCount"]);
let currentNodes = KubeNodeInventory
| where TimeGenerated > ago(5m)
| where ClusterName == clusterName
| where Status == 'Ready'
| extend parsedLabels = parse_json(Labels)
| extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool))
| where nodePoolName != ''
| summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName;
maxNodes
| join kind=leftouter currentNodes on $left.nodePoolName == $right.nodePoolName
| where coalesce(currentNodeCount, 0) >= maxCount
| project 
    clusterName, 
    nodePoolName, 
    currentNodeCount = coalesce(currentNodeCount, 0),
    maxCount,
    usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0),
    isAtMax = coalesce(currentNodeCount, 0) >= maxCount
| order by usagePercent desc

let clusterName = "aks-pixelrobots-dev";

let maxNodes = arg("").Resources

| where type == 'microsoft.containerservice/managedclusters' and name == clusterName

| extend agentPools = properties.agentPoolProfiles

| mv-expand agentPool = agentPools limit 50

| where agentPool.enableAutoScaling == true

| project

clusterName = name,

nodePoolName = tolower(tostring(agentPool["name"])),

maxCount = toint(agentPool["maxCount"]);

let currentNodes = KubeNodeInventory

| where TimeGenerated > ago(5m)

| where ClusterName == clusterName

| where Status == 'Ready'

| extend parsedLabels = parse_json(Labels)

| extend nodePoolName = tolower(tostring(parsedLabels[0].agentpool))

| where nodePoolName != ''

| summarize currentNodeCount = dcount(Computer) by ClusterName, nodePoolName;

maxNodes

| join kind=leftouter currentNodes on $left.nodePoolName == $right.nodePoolName

| where coalesce(currentNodeCount, 0) >= maxCount

| project

clusterName,

nodePoolName,

currentNodeCount = coalesce(currentNodeCount, 0),

maxCount,

usagePercent = iff(maxCount > 0, 100.0 * coalesce(currentNodeCount, 0) / maxCount, 0.0),

isAtMax = coalesce(currentNodeCount, 0) >= maxCount

| order by usagePercent desc

Setting Up the Alert

Once you have the alert query working, setting up the actual alert in Azure Monitor is straightforward.

Go to your Log Analytics workspace
Paste in the alert query and run it
Click “New alert rule”
For the condition, use a custom log search
Set it to trigger when results are greater than 0
Set frequency and evaluation to 5 minutes
Link to an action group with your preferred notifications
Name your rule and save it

Don’t forget to assign the correct permissions (Reader + Log Analytics Reader) to the alert’s managed identity.

Why This Setup Is Worth It

With this solution in place, you’re getting:

Real-time alerts based on live config and current usage
Accurate node counting by filtering only Ready nodes
Dynamic monitoring across all node pools, not just hardcoded ones
A clean, reusable query setup you can tweak anytime

Tips and Ideas

Want to monitor multiple clusters? Remove the name == clusterName filter
Want to track trends? Use the monitoring query in a workbook or dashboard
Want to route alerts to Teams or Slack? Add a webhook to your action group

Wrapping Up

This was a fun and educational deep dive into the power of combining Azure Resource Graph with KQL. If you’re running AKS and care about autoscaling, this alert setup gives you visibility and protection when things hit their limits.

Let me know if you give it a try or if you run into similar roadblocks. I’d love to hear how others are solving this challenge in their own clusters.

Happy scaling!

How to Set Up an AKS Node Pool maxCount Alert Using KQL and ARG

Published by Pixel Robots. on May 1, 2025 May 1, 2025

Why Monitor Node Pool Maximums

The Plan: Combine ARG with Log Analytics

The Journey: Hitting Roadblocks

Roadblock 1: ARG Permissions

Roadblock 2: mv-expand Limits

Roadblock 3: Empty nodePoolName Values

Roadblock 4: Wildly Inaccurate Node Counts

The Final Solution

Monitoring Query

Alert Query

Setting Up the Alert

Why This Setup Is Worth It

Tips and Ideas

Wrapping Up

Pixel Robots.

0 Comments

Leave a Reply Cancel reply

Inspektor Gadget Is Now an AKS Extension (Preview)

Azure Container Linux for AKS: Flatcar Grows Up

AKS MaxUnavailable Fallback Now in Preview

How to Set Up an AKS Node Pool maxCount Alert Using KQL and ARG

Published by Pixel Robots. on May 1, 2025 May 1, 2025

Why Monitor Node Pool Maximums

The Plan: Combine ARG with Log Analytics

The Journey: Hitting Roadblocks

Roadblock 1: ARG Permissions

Roadblock 2: mv-expand Limits

Roadblock 3: Empty nodePoolName Values

Roadblock 4: Wildly Inaccurate Node Counts

The Final Solution

Monitoring Query

Alert Query

Setting Up the Alert

Why This Setup Is Worth It

Tips and Ideas

Wrapping Up

Pixel Robots.

0 Comments

Leave a Reply Cancel reply

Related Posts

Inspektor Gadget Is Now an AKS Extension (Preview)

Azure Container Linux for AKS: Flatcar Grows Up

AKS MaxUnavailable Fallback Now in Preview