The other day someone reached out to me via LinkedIn and asked is it possible to have a Kubernetes deployment deploy on to an Azure Kubernetes Service (AKS) node pool first and when the node pool is full burst to Azure Container Instances (ACI), or virtual nodes in AKS.
Looking at the Microsoft docs regarding bursting to ACI, all of the examples tell you how to specify the Kubernetes deployment must use the virtual node, but not how to only use it when there is no space left in the node pool.
So, I decided to have a little play with node affinity. (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity) In particular the preferredDuringSchedulingIgnoredDuringExecution type. This type basically means “try to run this set of pods on nodes that match this, if not run them here”. I will go through this in more detail below with the example.
Below is a simple deployment manifest, but with an extra section.
Lines 1 to 7 are what you would normally see in a deployment. Hopefully, lines 18 – 24 are also in your deployment manifest files, but if not make sure you add them to ensure pods don’t use all your node resources.
The next line 25-26 is the start of the affinity section. On line 27 I have the “requiredDuringSchedulingIgnoredDuringExecution”. This tells the Kubernetes API this pod can only be scheduled on the following node selector terms. In this example we have two. The first one being only schedule the pod on a node with the label agentpool with the value of agentpool. The value agentpool being the name of the node pool. The second match expression is saying the label type needs to equal virtual-kubelet. This section tells the Kubernetes api to schedule on the virtual node.
Line 39-41 tells the Kubernetes api that the pods in this deployment are ok with being scheduled on nodes that have the virtual-kubelet.io/provider taint. The Virtual node has this taint. Keep in mind this toleration does not stop the pod from being scheduled on to nodes without it.
So, this deployment manifest is saying that the pod can be scheduled on any node pool called agentpool or on the virtual node, but how does it know to use the node pool first? Basically, the Kubernetes scheduler will prefer the node pool over the virtual node. Which is good as I are unable to edit the default scheduler in AKS.
If you have a system node pool, user node pool, virtual node the preferred scheduling order for this deployment will be the user node pool defined in the node affinity and then the virtual node. If you are using the system node pool and virtual node only then the system node pool will be preferred. If you have a system node pool, user node pool and a virtual node and are trying to schedule the pods on the system node pool then once the pods are running on a virtual node and you want to scale down, they will scale down from the system node pool first. This is due to the system node pool being important to the running of the AKS cluster.
See it in Action
For this blog post I have a simple AKS cluster with one system node pool and two user node pools. The user node pools are called agentpool, and test, the system node pool is called system. I also have the virtual node addon installed. As you can see, I have nothing running in the clusters default namespace.
Now I will deploy the manifest from above. This will deploy one pod as I have defined one replica.
If I use the command kubectl get pods -o wide I can see which node the pod is running on.
In this case it is aks-agentpool… node. Ok so now let’s add some more pods to see where they schedule. For that I will use the following command.
As you can see, I now have 16 pods running on the node pool and 1 on the virtual node. This is due to the agentpool node being full. I have got the maximum the node pool can scale to. In this case just one node. So, to note, the virtual node will only be used when the node pool has scaled up as much as is allowed and is fully utilised.
One thing to note is this works nicely for scaling up but does not work well for scaling down. When I scale back down to 1 the only pod that will be left will be running on the virtual node. This is due to the way the ReplicaSet controller chooses which pod to delete. It uses the following algorithm to decide.
- Pending (and unschedulable) pods are scaled down first
- If controller.kubernetes.io/pod-deletion-cost annotation is set, then the pod with the lower value will come first. (This is coming in beta with version 1.22)
- Pods on nodes with more replicas come before pods on nodes with fewer replicas. (This will be why in this case)
- If the pods’ creation times differ, the pod that was created more recently comes before the older pod (the creation times are bucketed on an integer log scale when the LogarithmicScaleDown feature gate is enabled)
When all the above match, then selection is random.
If anyone does have any ideas on how to handle this better, then please reach out.
Thanks for reading. I hope you found this article helpful, if you do have any questions or comments please reach out.