Reading Time: 5 minutes

If you’re running an Azure Kubernetes Service (AKS) cluster, you know how critical it is to monitor your node pools, especially when they hit their maximum node count and can’t scale further.

Recently, I faced the challenge of building an Azure alert that dynamically checks every node pool in an AKS cluster and fires if any of them hit their maxCount. Along the way, I ran into some tricky issues, from missing data and parsing problems to permission errors. In this guide, I’ll show you how to set up an AKS node pool maxCount alert to monitor when your cluster reaches its autoscaling limit.

Why Monitor Node Pool Maximums

In AKS, node pools are groups of VMs that run your Kubernetes workloads. With auto-scaling enabled, AKS automatically adds or removes nodes based on demand, but only up to a maximum (maxCount). If a node pool hits that limit, it won’t scale any further—leading to performance bottlenecks or even failed deployments.

Setting up alerts based on hardcoded thresholds is not ideal. The better approach is to create a dynamic alert that:

  • Checks all node pools in your cluster
  • Looks up each pool’s actual maxCount from Azure
  • Compares it to the current number of Ready nodes
  • Alerts you when a pool hits or exceeds that maximum

The Plan: Combine ARG with Log Analytics

The approach was straightforward in theory:

  1. Use Azure Resource Graph to retrieve each pool’s maxCount
  2. Use Log Analytics (KQL) to count how many nodes are actually running
  3. Join those results and alert when any pool is full

The idea was to join ARG data (node pool configs) with Log Analytics data (node counts) in a KQL query, then set up an alert to notify me when any node pool hits its limit. Sounds simple, right? Well, Azure had other plans!

The Journey: Hitting Roadblocks

Here are the key issues I ran into, and how I solved them.

Roadblock 1: ARG Permissions

The first issue came up immediately. When I tried to run the resource graph query in Log Analytics:

Errors occurred while resolving remote entities. Cluster=’https://ade.loganalytics.io/…’: not authorized to access resource name: AzureResourceGraph…

Turns out, the arg("") function only works if:

  • You (or the alert identity) have Reader access on the subscription
  • The workspace supports ARG queries

The fix:

  • Assign Reader to my user and to the managed identity of the alert
  • Also assign Log Analytics Reader for workspace access

Roadblock 2: mv-expand Limits

To extract each agent pool from the cluster properties, I used mv-expand. But then I hit this error:

Azure Resource Graph: At least one mvexpand operator has a row limit of 2147483647, which exceeds the max limit of 2000.

Even though my cluster only had a few pools, mv-expand tried to over-fetch. The fix was simple—add a limit to the expansion.

Here’s the pattern I used:

This tells the engine to expand only up to 50 pools, which is more than enough for a typical AKS cluster.

Roadblock 3: Empty nodePoolName Values

Next, I had trouble joining the ARG data with the live node data. Specifically, I couldn’t get a valid node pool name from KubeNodeInventory.

Originally, I tried:

But this returned blank values. The problem? Labels was an array, not a flat object.

Here’s the fix that worked:

This accesses the first object in the array, where the agentpool label lives. Adding tolower() ensures consistent casing when joining with ARG data.

Roadblock 4: Wildly Inaccurate Node Counts

Even after everything looked good, the counts were way off. Some node pools showed 200%+ usage, which is clearly wrong.

The mistake? I used count() to tally the nodes:

count() tallied all rows, including duplicates (e.g., multiple status updates per node). Plus, I wasn’t filtering for Ready nodes. I switched to:

dcount(Computer) counts unique nodes, and Status == 'Ready' ensures only healthy nodes are included. This brought counts back to reality (e.g., 3 nodes for systempool).

The Final Solution

Once all the pieces were working, I ended up with two queries: one for monitoring and one for alerting.

Monitoring Query

This query shows all auto-scaling node pools in your cluster and their current usage, so you can track everything in one view.

This gives a full overview of your cluster’s node pool usage.

Alert Query

This version filters the result to only include node pools that are at or over their maxCount. Use this one to power your alert rule.

Setting Up the Alert

Once you have the alert query working, setting up the actual alert in Azure Monitor is straightforward.

  1. Go to your Log Analytics workspace
  2. Paste in the alert query and run it
  3. Click “New alert rule”
  4. For the condition, use a custom log search
  5. Set it to trigger when results are greater than 0
  6. Set frequency and evaluation to 5 minutes
  7. Link to an action group with your preferred notifications
  8. Name your rule and save it

Don’t forget to assign the correct permissions (Reader + Log Analytics Reader) to the alert’s managed identity.

Why This Setup Is Worth It

With this solution in place, you’re getting:

  • Real-time alerts based on live config and current usage
  • Accurate node counting by filtering only Ready nodes
  • Dynamic monitoring across all node pools, not just hardcoded ones
  • A clean, reusable query setup you can tweak anytime

Tips and Ideas

  • Want to monitor multiple clusters? Remove the name == clusterName filter
  • Want to track trends? Use the monitoring query in a workbook or dashboard
  • Want to route alerts to Teams or Slack? Add a webhook to your action group

Wrapping Up

This was a fun and educational deep dive into the power of combining Azure Resource Graph with KQL. If you’re running AKS and care about autoscaling, this alert setup gives you visibility and protection when things hit their limits.

Let me know if you give it a try or if you run into similar roadblocks. I’d love to hear how others are solving this challenge in their own clusters.

Happy scaling!


Pixel Robots.

I’m Richard Hooper aka Pixel Robots. I started this blog in 2016 for a couple reasons. The first reason was basically just a place for me to store my step by step guides, troubleshooting guides and just plain ideas about being a sysadmin. The second reason was to share what I have learned and found out with other people like me. Hopefully, you can find something useful on the site.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *