Sometimes you have an application that needs to scale super-fast, so fast that you can’t wait for a new Kubernetes node to spin up before your pods can be scheduled. Azure Kubernetes Service has a cluster auto scaler which can be enabled on cluster build or added after. This is awesome, it will automatically add a new node when needed based on CPU and memory. It takes time for new nodes to spin up, sometimes up to 15 minutes which is no good when you have customers waiting.
In this article I am going to show you how you can always have a node ready for when you need to scale your application. In fact, if your app starts to use any resources on this node a new one will start to spin up, so you always have one ready. This is called over provisioning.
If you did not know for each pod you can actually set a pod priority. Couple this with pre-emption, another Kubernetes feature, you can specify some pods to be classed as lower priority. This means they will get removed in order to make space for the higher priority pods waiting to be scheduled, aka your application. This will cause the auto scalar to create a new node and schedule the lower priority pods on to it. Hence giving you a new node ready for your application to scale to.
To make sure we are not actually using load on the nodes and only reserving the capacity with our lower priority nodes we can use the Kubernetes pause container. This is something that comes with Kubernetes but has a different purpose. The container will basically just sleep until a signal is received, perfect for this.
So basically, we run enough of the lower priority containers to give us an extra node in our node pool. When your application needs to scale Kubernetes will then remove the lower priority pod from the node and your application pod will be scheduled. The lower priority pod will then go into pending. The cluster auto scaler will then go ahead and scale a new node.
Below you will find the manifest needed to set this up. All you must do is configure the Replica count, the CPU requests, and nodepoolname to match your needs. It really is that simple.
If you want it for the full cluster, you can remove the node affinity section in the deployment section.
Let’s explain a little first though as you will be deploying a few resources.
First, we are creating a PriorityClass called overprovisioning and setting it to -1. Super low priority.
Next, we are creating a new ServiceAccount, ClusterRole, and ClusterRoleBinding. You will see that the new service account has limited access on the cluster. This is good for security reasons, if you exec into the auto scaler pod you will not have full access to the cluster.
Then we create the lower priority deployment. This is where you set the number of replicas and the CPU requests needed to ensure you always have one node free. You will see the container image is pause.
And finally, we create another deployment. This is actually the new auto scaler. The last line is where we match the auto scaler to the new service account.
So go ahead and apply the yaml to your cluster. You should then see some pods being created and then a new node spins up. As mentioned above you may need to tweak the replica and CPU count, but once you have that you are good to go.
To run this on multiple node pools you will need to change the deployment name of the overprovisioning deployment. Just add the node pool name to the beginning or end. An example for a windows node pool can be found at the bottom of this article. You will also notice there is an image tag. This image works with windows.
Before you deploy the manifest, you will need a new namesapce to keep things tidy. Use the following command to do that.
Windows node pool manifest
Just change the node pool name and set your replica count and CPU requests to your needs.
Thanks for reading and I hope you found this helpful. If you have any comments or suggestions on how to improve this let me know.