As someone who manages an AKS cluster you really need to know how to troubleshoot it. Luckily Azure has your back a bit with the AKS Diagnostic tool. This tool will help you identify and resolve issues with your cluster. This is a nice little feature that comes with your AKS cluster and can be accessed via the Azure portal. It also comes with no extra configuration or cost. How cool is that!
How to access AKS Diagnostics
In the Azure portal navigate to the AKS cluster that you would like to troubleshoot. Then click on Diagnose and solve problems on the left (near the top).
In here you will see one option. Cluster insights, click it.
In this new blade, you will see some checks running. Once they have run you should see at a glance the state of your cluster. If you have no issues then it will all be green.
Oh no, I see some red
If any of the checks have failed you will see them shown under the Observations section in the report.
You can click on the More info button to drill down to find further information about the issue and some times a recommended action to resolve the issue.
As you can see above one of the nodes in the AKS cluster is powered down. The Recommended Action is telling me to restart the node so it can re-join the AKS cluster. Now it would have been nice if there was a button to click on this blade to restart the VM rather than having to go and find it in the portal and manually restart it.
Once you have restarted the node and waited around 15 minutes or so. If you rerun the diagnostics you will see that the issue has been resolved.
So what checks are performed?
The cluster insights actually have three categories of tests.
Cluster Node Issues
This checks for node-related issues that can cause your cluster to misbehave.
- Node readiness issues
- Node failures
- Insufficient resources
- Node missing IP configuration
- Node CNI failures
- Node not found
- Node power off
- Node authentication failure
- Node kube-proxy stale
Create, read, update & delete operations
This checks for any CRUD operations that may cause your cluster to have issues.
- In-use subnet delete operation error
- Network security group delete operation error
- In-use route table delete operation error
- Referenced resource provisioning error
- Public IP address delete operation error
- Deployment failure due to deployment quota
- Operation error due to organization policy
- Missing subscription registration
- VM extension provisioning error
- Subnet capacity
- Quota exceeded error
Identity and security management
This checks for any authentication or authorisation errors that may cause communication issues to your cluster.
- Node authorization failures
- 401 errors
- 403 errors
This cool free feature from Microsoft is a good start to help troubleshoot your AKS cluster. Hopefully, we may see it grow with checks and maybe even an alert feature built-in.
If you do need to go deeper into your AKS troubleshooting or to collect logs Microsoft has created a tool called AKS Periscope which I will go into in another blog post. So keep an eye out for that.
I hope you found this article helpful, If you have any questions please reach out in the usual ways.