Each worker node of kubernetes cluster runs system critical resources besides applications workload. Those resources could be kubernetes system daemons, OS level components and possibly some tools which shape the landscape of the cluster. They should ideally be distinguished and isolated from actual workload of the cluster to avoid “compete for resources” situation when starvation for resources could bring the node down. This is especially important when there is a need to manage fleet of clusters with similar configuration in environment where there is strict separation between teams managing the cluster and teams using it.
Fortunately, kubelet provides some configuration to have clear resource separation and to define node Capacity available to Kubernetes cluster. Capacity is defined in terms of CPU, memory, and ephemeral-storage.
The kubelet is node agent that runs on the worker node and bridge the communication between the cluster control plane and the worker node of the cluster.
kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. It is not meant to reserve resources for system daemons that are run as pods. kube-reserved is typically a function of pod density on the nodes.
system-reserved is meant to capture resource reservation for OS system daemons like sshd, udev, etc. system-reserved should reserve memory for the kernel too since kernel memory is not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is also recommended (user.slice in systemd world).
These two flags define node pressure eviction behavior. Whenever kubelet detects memory pressure on the node it tries to terminate pods to allow scheduler to run them on another node. The difference between two is that eviction-soft settings respects grace period during pod termination (eviction-max-pod-grace-period).
How to configure it in Gardener?
Gardener is a software which helps to operate and manage fleet of clusters on configured infrastructure at scale. Gardener introduces opinionated way of management of end user clusters by sharing resources where end user clusters control planes are running. Usually there exists three levels of clusters: Garden, Seed and Shoot. Shoot is an end user cluster and represented by Shoot kind which contains besides other configuration of the kubelet.
apiVersion: core.gardener.cloud/v1beta1 kind: Shoot … spec: … kubernetes: … kubelet: evictionHard: memoryAvailable: 200Mi evictionSoft: memoryAvailable: 400Mi evictionSoftGracePeriod: memoryAvailable: 1m0s failSwapOn: true imageGCHighThresholdPercent: 50 imageGCLowThresholdPercent: 40 kubeReserved: cpu: 80m memory: 1Gi pid: 20k systemReserved: cpu: 80m ephemeralStorage: 1Gi memory: 1Gi … provider: controlPlaneConfig: apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1 kind: ControlPlaneConfig infrastructureConfig: apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1 enableECRAccess: true kind: InfrastructureConfig … type: aws workers: - machine: … type: m5.large
Describing the node, we get:
… Capacity: cpu: 2 ephemeral-storage: 50553132Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7941248Ki pods: 110 Allocatable: cpu: 1840m ephemeral-storage: 48104344948 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 5741696Ki pods: 110
As seen difference between m5.large capacity and allocatable for pods is 2Gi we reserved for system resources. But this will not solve the problem of some deployments having priority over the others.
Pod Priority and Preemption
Pods (deployments) could have priority and preemption over other pods on the node. To achieve this, we should reflect deployments importance hierarchy into PriorityClass’s. A PriorityClass is a non-namespaced object that defines a mapping from a priority class name to the integer value of the priority. The higher the value, the higher the priority. Default priority value is zero. So, the strategy could be to create a separate PriorityClass with high number for landscape tooling.
As Gardener setup creates PriorityClass already for it needs
apiVersion: scheduling.k8s.io/v1 description: Used for system critical pods that must run in the cluster, but can be moved to another node if necessary. kind: PriorityClass metadata: name: system-cluster-critical preemptionPolicy: PreemptLowerPriority value: 2000000000
We create another one for our specific tooling with slightly lower priority
apiVersion: scheduling.k8s.io/v1 description: Used for cluster critical tooling kind: PriorityClass metadata: name: cluster-critical-tooling preemptionPolicy: PreemptLowerPriority value: 1500000000
As example will take datadog installation. Setting new PriorityClass in HelmRelease.
apiVersion: helm.toolkit.fluxcd.io/v2beta1 kind: HelmRelease metadata: name: datadog spec: … values: agents: priorityClassName: cluster-critical-tooling
This will update all current pods and will make them more important than other load on the node.
spec: ... preemptionPolicy: PreemptLowerPriority priority: 1500000000 priorityClassName: cluster-critical-tooling