URL has been copied successfully!
URL has been copied successfully!
URL has been copied successfully!
URL has been copied successfully!
URL has been copied successfully!
Share:
Twitter
LinkedIn
Facebook
Reddit
Follow by Email
Copy link
Threads
Bluesky
Reading Time: 10 minutes

I spotted this one via the AKS Docs Tracker I built and run here on pixelrobots.co.uk. It picked up new docs appearing under the AKS section and this one caught my eye immediately. Microsoft has shipped a public preview of something called Container Network Insight Agent, and it is genuinely interesting enough to stop and write about.

Put simply, it is an AI-powered network diagnostics assistant that runs as a pod inside your AKS cluster. You describe a networking problem in plain English, and it collects evidence from your cluster using kubectl, cilium, and hubble, and returns a structured report with root cause analysis and actual remediation commands. Not suggestions. Commands you can copy and run.

AKS networking issues are some of the most painful to debug. They can span from Kubernetes network policies all the way down to kernel ring buffers and NIC-level drops, across multiple tools and layers. Anything that brings that into a single interface is worth paying attention to.

What it actually does

Container Network Insight Agent covers three main diagnostic categories.

DNS troubleshooting checks CoreDNS pod health, endpoint registration, configuration (including custom ConfigMaps), NodeLocal DNS status, and network policies that might be silently blocking DNS traffic. It also evaluates Cilium FQDN egress restrictions which can cause hard-to-trace external resolution failures.

Packet drop analysis is the one that stood out to me. It deploys a lightweight debug DaemonSet to each node to collect host-level network statistics and uses delta measurements to separate active drops from historical counters. It checks NIC ring buffer utilisation via ethtool, kernel softnet stats from /proc/net/softnet_stat, per-CPU SoftIRQ distribution, socket buffer saturation, and CNI-specific interface data. This is the kind of thing that normally requires SSH-ing into nodes and running commands manually one at a time.

Kubernetes networking diagnostics covers pod-to-service connectivity, service endpoint registration, network policy conflicts (both Kubernetes NetworkPolicy and CiliumNetworkPolicy), and Hubble flow analysis. A common catch here is a targetPort mismatch between the service and pod. This produces connection timeouts even when endpoints look healthy, and the agent flags it directly.

Each diagnostic returns a structured report with an evidence table, root cause identification, and specific remediation commands. The agent itself does not make any changes to your cluster. It is advisory only, with read-only cluster access.

Worth flagging upfront is that this is a cloud-only AKS feature. It does not work on AKS hybrid, AKS on Azure Stack HCI, or Arc-enabled Kubernetes clusters.

What you need before you start

This setup has more moving parts than the average AKS preview feature, so it is worth laying out the requirements clearly before jumping in.

Tools required: Azure CLI 2.77.0 or later, kubectl, jq, git, and the k8s-extension Azure CLI extension. The jq requirement matters here because the install scripts are Bash-based. This is not a PowerShell-friendly setup. Everything in this walkthrough runs in Bash.

Azure permissions: Contributor and User Access Administrator on the target resource group, plus access to create Azure OpenAI resources and Entra ID App Registrations.

Cluster requirements: workload identity and OIDC issuer enabled, a supported Kubernetes version, and minimum node size of Standard_D4_v3 with three nodes recommended. Your cluster also needs outbound HTTPS access to your Azure OpenAI endpoint on port 443, and must be able to pull images from acnpublic.azurecr.io.

Supported regions: centralus, eastus, eastus2, uksouth, westus2. If your cluster is not in one of these, you are blocked for now.

Azure OpenAI: you need a deployed model. GPT-4o or later is recommended. If you do not already have one, the walkthrough below covers creating it.

The agent works on clusters without Cilium or ACNS, but with reduced capabilities. Without Advanced Container Networking Services (ACNS) you lose Hubble flow analysis and Cilium policy diagnostics, but DNS, packet drop, and standard Kubernetes networking diagnostics all still work. If you are running Azure CNI powered by Cilium with ACNS enabled, you get the full feature set including Hubble flow observation.

Setting it up

The setup is eight steps. I will walk through each one. All commands are Bash.

Step 1: Set variables and create the resource group

Set your environment variables first. Keeping them all in one place makes the subsequent steps much cleaner.

Once the resource group is created you should see a JSON response confirming the provisioningState is Succeeded.

Step 2: Create the AKS cluster

If you already have a cluster, the key requirements are workload identity and OIDC issuer enabled. If you need to create one, the following gives you Cilium dataplane and ACNS for full diagnostic capabilities.

This takes a few minutes. Once it finishes, you will see the cluster details returned as JSON.

If you have an existing cluster without workload identity and OIDC issuer, enable them with:

The update will complete with no output on success.

Then pull down the credentials so kubectl can talk to the cluster:

You should see Merged "aks-pixelrobots-cna" as current context confirming the kubeconfig is ready.

Step 3: Create an Azure OpenAI resource and deploy a model

Skip this step if you already have an Azure OpenAI resource with a deployed model. Just export OPENAI_SERVICE_NAME, OPENAI_DEPLOYMENT_NAME, and retrieve the endpoint:

Once AZURE_OPENAI_ENDPOINT is set, skip ahead to Step 4.

If you need to create one from scratch:

The command returns immediately but the resource is not yet ready. Wait for provisioning, then deploy the model. The loop below polls every 10 seconds rather than using a fixed sleep.

With AZURE_OPENAI_ENDPOINT now set, you have everything you need from the OpenAI side. Move on to the managed identity.

Steps 4 and 5: Create a managed identity and assign roles

The agent authenticates to Azure services using a user-assigned managed identity with workload identity federation.

You should see both IDENTITY_CLIENT_ID and IDENTITY_PRINCIPAL_ID printed. You will need both in the next commands.

The identity needs four roles: Cognitive Services OpenAI User on the OpenAI resource, Azure Kubernetes Service Cluster User Role and Azure Kubernetes Service Contributor Role on the cluster, and Reader on the resource group.

Role assignments can take up to 10 minutes to propagate. If you see 401 or 403 errors from the agent pod shortly after install, wait a few minutes and restart the pod before debugging further.

Step 6: Configure federated credentials

Link the managed identity to the Kubernetes service account the agent uses:

This returns the federated credential resource as JSON with a name and issuer field. Confirm the issuer URL matches your cluster’s OIDC endpoint before moving on.

Step 7: Create an App Registration for Entra ID authentication

This is required for production deployments. For development and testing only you can use simple username login, but for anything resembling a real environment you want Entra ID.

Both values should be printed. Keep them handy as you will need them in the extension install step.

If your tenant requires a SERVICE_MANAGEMENT_REFERENCE, add --service-management-reference $SERVICE_MANAGEMENT_REFERENCE to the az ad app create command above.

Add the required Microsoft Graph delegated permissions (openid, profile, User.Read, offline_access):

This produces no output on success. If your tenant requires admin consent, run az ad app permission admin-consent --id $APP_CLIENT_ID before continuing.

Then add a federated credential on the App Registration itself. This is separate from the managed identity federated credential in step 6, and links the app registration to the same Kubernetes service account:

Note that only localhost redirect URIs are currently supported. Public LoadBalancer URLs are not supported for redirect URIs, which means you access the interface via port-forward even in a more permanent setup.

Step 8: Install the extension

This is where the ACNS and non-ACNS paths diverge.

For clusters with ACNS and Cilium dataplane (full capabilities):

The command returns immediately and the extension provisions in the background. Skip the next command and go straight to verifying the extension state.

For clusters without ACNS (Hubble disabled):

As with the ACNS variant, the extension provisions in the background. Check the state with the following command:

Verify the extension installed correctly:

The provisioningState should return Succeeded. You can also verify the Kubernetes service account was created with the right annotation:

Check that azure.workload.identity/client-id matches your managed identity client ID.

Accessing it

Once installed, port-forward to reach the interface:

Open http://localhost:8080 in your browser. You’ll be prompted to sign in with either your simple username (if you configured development mode) or Microsoft Entra ID credentials.

Once in, type your networking questions in plain English. The agent classifies the issue, collects evidence from the cluster, and returns a structured report. A few things worth knowing about the interface. Sessions time out after 30 minutes of inactivity and have an absolute 8-hour limit. Chat history lives in memory only and is lost if the pod restarts. At around 15 exchanges, the agent starts summarising older messages, so for very long sessions you may notice it losing context on earlier findings.

Things to watch for

The packet drop diagnostic deploys a debug DaemonSet called rx-troubleshooting-debug to the kube-system namespace. It needs hostNetwork, hostPID, hostIPC, and NET_ADMIN capabilities to collect host-level network stats. The DaemonSet is shared across sessions and cleaned up automatically, but if the agent pod crashes during a diagnostic it can be left behind. Clean it up manually if needed:

On larger clusters, packet drop diagnostics become slower and more resource-intensive. The docs recommend limiting concurrent users to three on a 25-node cluster and to one on a 50-node cluster. For most teams this is fine, but something to be aware of if you are planning on making this available to multiple engineers simultaneously.

If your extension install fails, the most common causes are an unsupported region, workload identity or OIDC issuer not enabled, or insufficient permissions. Running az k8s-extension create a second time on an already-installed extension will error. Use az k8s-extension update to change configuration settings.

When debugging startup issues, pod logs tell you most of what you need to know:

Look for any lines referencing Missing required, 401, 403, or bootstrap.validation_agent_failed as these point to the most common setup problems.

Quick health check once the pod is running:

The /ready endpoint returning HTTP 200 means at least one pre-warmed agent is initialised and ready to handle requests. The first query after a pod restart typically takes 10 to 30 seconds as the warmup pool initialises. Subsequent queries are faster.

For common error patterns, the table below covers the main ones:

ErrorLikely causeFix
401 Unauthorized / 403 ForbiddenMissing RBAC role or workload identity misconfigurationCheck roles with az role assignment list --assignee <principal-id> --all -o table
RuntimeError: Missing required Azure OpenAI environment variable(s)ConfigMap has placeholder valuesRun az k8s-extension update with the correct settings
FailedMount / volume mount errorMissing Hubble certificate secretsDeploy with hubble.enabled=false or ensure ACNS is enabled
Pod stuck in CrashLoopBackOffAzure OpenAI unreachable or misconfiguredCheck the endpoint URL, deployment name, and outbound HTTPS on port 443
redirect_uri mismatch on Entra loginRedirect URI not set to http://localhost:8080/auth/callbackUpdate the App Registration in the Azure portal

My thoughts

Honestly, I think the concept here is really strong. AKS networking issues are some of the most time-consuming problems to debug, precisely because they do not live in one place. You bounce between kubectl, cilium, node SSH sessions, Azure Monitor, and your own institutional knowledge hoping something connects. The idea of describing the problem in plain English and getting a structured, evidence-backed report with actual commands to run is the right direction.

What impresses me most is the packet drop diagnostic. Getting host-level network stats like NIC ring buffer state and kernel softnet counters out of an AI-driven chat interface, with delta measurements to distinguish active drops from historical noise, is genuinely non-trivial. That is work that used to take an experienced engineer a fair chunk of an afternoon.

The current limitations are real but predictable for a public preview. Port-forward only access, in-memory session state, localhost redirect URIs, and a short supported region list are the kind of rough edges that tend to get addressed before GA. The Azure OpenAI dependency adds cost and some infra overhead, which is worth factoring in if you are thinking about this for a team.

This feature is also a really compelling reason to look seriously at Advanced Container Networking Services if you have not already. ACNS brings you Hubble flow analysis, Cilium policy diagnostics, FQDN-based egress filtering, Layer 7 network policy, deep DNS observability, and now Container Network Insight Agent on top of all of that. The cost is around $0.025 per node per hour. On a three-node cluster that is around $54 a month. On a ten-node cluster you are looking at roughly $180 a month. For most teams running AKS in production that is a reasonable price for the observability and security capabilities you get, and Container Network Insight Agent makes the value easier to demonstrate by turning that data into actionable diagnostics you can actually use when something breaks at 2am.

If you are running ACNS today, I would set this up in a non-production cluster now and get familiar with it. The full diagnostic set including Hubble flow analysis and Cilium policy diagnostics makes it significantly more capable than the baseline. If you are not running ACNS, it is still worth a look for the DNS and packet drop coverage alone, and it might just be the nudge you needed to enable ACNS properly.

Give it a try and let me know what you think in the comments. And if you want to stay on top of new AKS features like this as they land in the docs, the AKS Docs Tracker is how I caught this one.

Share:
Twitter
LinkedIn
Facebook
Reddit
Follow by Email
Copy link
Threads
Bluesky

Pixel Robots.

I’m Richard Hooper aka Pixel Robots. I started this blog in 2016 for a couple reasons. The first reason was basically just a place for me to store my step by step guides, troubleshooting guides and just plain ideas about being a sysadmin. The second reason was to share what I have learned and found out with other people like me. Hopefully, you can find something useful on the site.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *