URL has been copied successfully!
URL has been copied successfully!
URL has been copied successfully!
Share:
Twitter
LinkedIn
Facebook
Reddit
Follow by Email
Copy link
Threads
Bluesky
Reading Time: 4 minutes

I picked this up from my AKS Docs Tracker when the docs page for Cluster Health Monitor appeared. It is currently in preview, but it is one of the more operationally interesting things Microsoft has quietly added to AKS in a while.

This is not a dashboard feature. It is an AKS-managed add-on that runs inside your cluster, actively tests whether your core components are working, and can automatically remediate a stuck CoreDNS pod before it becomes a full DNS outage. That last bit especially caught my attention.

Why this matters

DNS failures are one of the most common ways an AKS cluster starts quietly falling apart. A single stuck CoreDNS pod drags down resolution rates, workloads start timing out, and depending on your alerting setup you might not catch it until it becomes obvious. The same applies to API server connectivity issues and a degraded metrics server. These are things that are broken but not immediately visible in your standard dashboards.

Cluster Health Monitor addresses this by running continuous in-cluster checks against these exact failure modes and exposing the results as Prometheus metrics you can alert on directly. It is not a replacement for Azure Monitor, Container Insights, or managed Prometheus. It is a targeted health-checking layer for the parts of AKS that matter most to day-two operations.

What Cluster Health Monitor actually does

Cluster Health Monitor is an AKS-managed add-on deployed in the kube-system namespace. It runs periodic checks and evaluates three signals:

  • DNS resolution success rate across CoreDNS, per-pod DNS, and LocalDNS paths
  • API server connectivity using synthetic create, get, and delete operations from within the cluster
  • Metrics server availability by querying the metrics-server API

Results are exposed as Prometheus metrics on port 9800. Worth noting is that this does not require Azure Monitor managed Prometheus. You can scrape these with a self-managed stack, Grafana Agent, or managed Prometheus if you already have it. A lot of monitoring features get skipped because they bundle in a dependency you are not ready for. This one does not have that problem.

The API server check is worth calling out. It does not just ping the endpoint. It runs synthetic create, get, and delete operations, which is a real end-to-end test of the in-cluster control plane path. You can have a technically healthy API server that still fails requests from inside the cluster due to a networking or RBAC issue. This catches that.

Each check returns one of three statuses:

StatusMeaning
HealthyComponent is operating normally
UnhealthyCheck failed. The metric label includes an error code.
UnknownInconclusive, for example during pod startup or if a dependency is unavailable

When a check reports Unhealthy, the error code tells you exactly what failed: APIServerCreateTimeout, CoreDNSServiceTimeout, MetricsServerUnavailable, and so on. That makes triage faster than a generic alert with no context.

It is also worth being clear that this is not the same as the AKS control plane metrics (preview) feature in Azure Monitor. That feature scrapes control plane telemetry like etcd and kube-scheduler into managed Prometheus. Cluster Health Monitor is an in-cluster component testing whether core services are actually responding. They solve different problems and work well together.

CoreDNS auto-remediation

This is the bit that makes it different from a standard health exporter.

If a CoreDNS pod is continuously unhealthy for five minutes, Cluster Health Monitor will delete it automatically. Kubernetes recreates it. The conditions are deliberately tight. Exactly one pod must be unhealthy, at least one other CoreDNS pod must be healthy and serving traffic, and no remediation can have fired in the last hour.

That is a sensible design. It will not cascade during a genuine incident. The behaviour I want to validate in a dev cluster is whether the per-pod health checks are reliable enough that it never incorrectly flags a healthy pod. The conditions look tight enough, but that is exactly the kind of edge case worth testing before you depend on it.

How to enable it

You need Azure CLI 2.73.0 or later and aks-preview extension 19.0.0b25 or later. Install or update the extension:

You can check the installed version with az extension show --name aks-preview --query version.

New cluster

Pass the flag at cluster creation time:

Existing cluster

If you already have a cluster running, an update is all you need:

AKS will deploy the add-on to kube-system without any other changes to your cluster. Verify it is running:

Once it is up, metrics are available on port 9800 and ready to scrape with any Prometheus-compatible setup.

To remove it later: az aks update -g myResourceGroup -n myCluster --disable-continuous-control-plane-and-addon-monitor.

Would I use this in production?

Not yet as a primary dependency, but I would be running it in dev and staging today.

It is preview, so best-effort support and no SLA. Do not enable it on your most sensitive clusters until you have spent some time watching what it produces in a lower environment.

Enable it, connect the metrics to your Prometheus setup, and spend some time with the error codes. The ones worth building alerts around first are DNS timeouts and API server connectivity failures. Those are the failure modes that sneak up on you.

If you are still getting your monitoring foundations in place, do that first. Managed Prometheus, Grafana, sensible alerting on node and pod health. Once that is solid, Cluster Health Monitor slots in easily and adds signal that your existing setup probably does not give you.

The teams that will get the most from this are platform engineering and SRE teams running multiple clusters, especially anyone who has dealt with a slow-burn CoreDNS problem that took too long to pin down. If that has never happened to you yet, it will.

Cluster Health Monitor will not get the same attention as new node pool options or shiny networking releases. But for day-two AKS operations this is exactly the kind of quiet improvement that prevents bad days. The docs are at learn.microsoft.com/azure/aks/cluster-health-monitor.

Share:
Twitter
LinkedIn
Facebook
Reddit
Follow by Email
Copy link
Threads
Bluesky

Pixel Robots.

I’m Richard Hooper aka Pixel Robots. I started this blog in 2016 for a couple reasons. The first reason was basically just a place for me to store my step by step guides, troubleshooting guides and just plain ideas about being a sysadmin. The second reason was to share what I have learned and found out with other people like me. Hopefully, you can find something useful on the site.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *