Kubernetes pods deployment at scale
A developer team approaches you as a Kubernetes Admin and explains that their app runs some data and CPU-intensive processes and they want to run a pod attached to the service, on two dedicated nodes in the cluster where no other pods will run so all resources will be reserved to them. Also, they want to ensure that only one app replica will run on the same node. A question that came up is how is it that no pod gets scheduled on the master node, as it's part of the cluster node?
Let me first answer this question if you want to read only some of the article. The master node can not host any application pods as it is Taint which prevents none master related pods to be deployed in it. Interested to know what is Taint? continue reading.
NodeName and NodeSelector
In Kubernetes, a pod gets scheduled on one of the worker nodes that is smartly selected by the Kubernetes scheduler as the scheduler knows which is the best worker to place the pod. However, in some cases, we do want to decide where the pod gets scheduled, meaning we can set parameters that narrow the placement workers based on a specific policy we can define. We do that by an attribute called NodeName and its value will be a specific node name in the deployment YAML, simple as that.
In the YAML below you can see the new value that state the node (workder1) that will host the pod deployment. very straightforward.
Dynamic nodeNames
But what if node names are dynamic and you don't know what will be the name beforehand? This is very common in a cloud environment where a node is dynamically provisioned. Or if the node doesn't have enough resources to run the pod, in that case, the pod will not be scheduled anywhere. So the question is how we can specify node selector with dynamic values? Fo that we can select the nodeSelector attribute that allows specifying nodes using labels. Labels can be your dev environment with high CPU or RAM resources and schedule pods selectively on them for example. If you run kubectl get node — show-labels you will see that nodes already have auto-generated labels but you can set your custom labels.
For example, a worker has many CPU resources and we want to schedule our pod to this specific one. To do that you should run from the cli:
kubectl label node [node name] type=cpu now that you have a new node label it's time to create pod definition YAML that addresses the CPU label. Once the below YAML will be happy the pod will be deployed on the worker with the CPU label and only on it.
Node Affinity
As nodeSelector provides flexibility based on labels it will not solve the problem if the labels node will not have enough resources and could cause the deployment to fail. Select more flexible expressions that will be useful in thousands of nodes in your environment, not just a few nodes. A node should run in different regions, zones, and different subnets for that you may want to use nodeAffinity. Affinity language is more expressive and can match labels with logical operators like In, Not In, Exist, DoesNotExist, Gt, Lt the syntax is more complex. You can define multiple rules and they are 2 types of node affinity:
- “hard” affinity = required — it means that the rule has to absolutely what is defined in the rule, no match? no deployment! same as nodeSelector.
- “soft” affinity = preferred — will try to find the noise that matches the expression but if it doesn't it will schedule the node anyway. So we can make sure that the pod will be scheduled somewhere. This is for best-effort scenarios.
As for the operator, we can use the followings:
- In — the above example specify the region the deployment will happen.
- Exists — mean match all nodes that have s specific label defined
- Gt- Graten than — for example, rather CPU
- Lt- Less Than — for instance, less CPU
- Not In — negative operator for example you don't want to deploy on Ubuntu 16.4 version nodes
- DoesNotExist — nodes that do not have a specific label like don't deploy in SR-IOV nodes.
The example YAML show a hard rule for Linux-only pods and a soft one for intensive CPU node.
Taints and Tolerations
If you read so far then you know that Kubernetes master node is Taint and prevents none master pods to be deployed on it. So far I covered how to prevent pods from being deployed on nodes but we can also use the opposite what if we want to config nodes with the same logic? this is where Taint comes into place. To view your node Taint : kubectl describe node [node-name] if you want to see all your Taint nodes: kubectl describe node | grep Taint. The one who sets the Taint for the master node is kubeadm at the cluster initiation phase. However, we do have some pods that are deployed on the master like, etcd, network plugin, kube-proxy these pods have been deployed using Toleration that can take action even when a Taint is present, that how all master pods are deployed using Toleration. If you want for example to deploy a pod on the master node that collects its logs you need to configure Tolarate so it will tolerate the Taint of the master node. Once a Toleration is configured the pod will tolerate all nodes so we need to make sure it will tolerate the specific one. For that, we add NodeSelector or NodeName, or NodeAffinity, confused? Let us see what the YAML looks like:
Inter-Pod Affinity and Anti-Affinity
Let's assume we have a dynamic infrastructure so nodes get added and removed based on application load. We plan to deploy a log collector pod on a master node, we have 5 master nodes. We specify 5 replica nodes and NodeSlector to be scheduled on master nodes. However, the nodeSlector finds only one master node, what will happen with the other 4 replicas? They will be scheduled on that one master node, however, the plan was 1 replica per node. So how should we solve that? we use Inter-Pod Anti-Affinity that allows you to constrain which nodes your pod is eligible to be scheduled, based on labels on pods that are already running on the node. In other words in a node is running pod A another pod A will not be able to be deployed on this node only one that doesn't.
Join my Linkedin