Nodes compute resource reservation for Gardener managed clusters.

June 20, 2022

Each worker node of kubernetes cluster runs system critical resources besides applications workload. Those resources could be kubernetes system daemons, OS level components and possibly some tools which shape the landscape of the cluster. They should ideally be distinguished and isolated from actual workload of the cluster to avoid “compete for resources” situation when starvation for resources could bring the node down. This is especially important when there is a need to manage fleet of clusters with similar configuration in environment where there is strict separation between teams managing the cluster and teams using it.

Node Capacity

Fortunately, kubelet provides some configuration to have clear resource separation and to define node Capacity available to Kubernetes cluster. Capacity is defined in terms of CPU, memory, and ephemeral-storage.

The kubelet is node agent that runs on the worker node and bridge the communication between the cluster control plane and the worker node of the cluster.

Node Capacity

Kube Reserved

--kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. It is not meant to reserve resources for system daemons that are run as pods. kube-reserved is typically a function of pod density on the nodes.

System Reserved

--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

system-reserved is meant to capture resource reservation for OS system daemons like sshd, udev, etc. system-reserved should reserve memory for the kernel too since kernel memory is not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is also recommended (user.slice in systemd world).

Eviction Thresholds

--eviction-soft=[memory.available<500Mi]
--eviction-hard=[memory.available<500Mi]
--eviction-max-pod-grace-period

These two flags define node pressure eviction behavior. Whenever kubelet detects memory pressure on the node it tries to terminate pods to allow scheduler to run them on another node. The difference between two is that eviction-soft settings respects grace period during pod termination (eviction-max-pod-grace-period).

How to configure it in Gardener?

Gardener is a software which helps to operate and manage fleet of clusters on configured infrastructure at scale. Gardener introduces opinionated way of management of end user clusters by sharing resources where end user clusters control planes are running. Usually there exists three levels of clusters: Garden, Seed and Shoot. Shoot is an end user cluster and represented by Shoot kind which contains besides other configuration of the kubelet.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
…
spec:
  …
  kubernetes:
    …
    kubelet:
      evictionHard:
        memoryAvailable: 200Mi
      evictionSoft:
        memoryAvailable: 400Mi
      evictionSoftGracePeriod:
        memoryAvailable: 1m0s
      failSwapOn: true
      imageGCHighThresholdPercent: 50
      imageGCLowThresholdPercent: 40
      kubeReserved:
        cpu: 80m
        memory: 1Gi
        pid: 20k
      systemReserved:
        cpu: 80m
        ephemeralStorage: 1Gi
        memory: 1Gi
  …
  provider:
    controlPlaneConfig:
      apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
      kind: ControlPlaneConfig
    infrastructureConfig:
      apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
      enableECRAccess: true
      kind: InfrastructureConfig
      …
    type: aws
    workers:
    - machine:
        …
        type: m5.large

Describing the node, we get:

…
Capacity:
  cpu:                2
  ephemeral-storage:  50553132Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7941248Ki
  pods:               110
Allocatable:
  cpu:                1840m
  ephemeral-storage:  48104344948
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             5741696Ki
  pods:               110

As seen difference between m5.large capacity and allocatable for pods is 2Gi we reserved for system resources. But this will not solve the problem of some deployments having priority over the others.

Pod Priority and Preemption

Pods (deployments) could have priority and preemption over other pods on the node. To achieve this, we should reflect deployments importance hierarchy into PriorityClass’s. A PriorityClass is a non-namespaced object that defines a mapping from a priority class name to the integer value of the priority. The higher the value, the higher the priority. Default priority value is zero. So, the strategy could be to create a separate PriorityClass with high number for landscape tooling.

As Gardener setup creates PriorityClass already for it needs

apiVersion: scheduling.k8s.io/v1
description: Used for system critical pods that must run in the cluster, but can be
  moved to another node if necessary.
kind: PriorityClass
metadata:
  name: system-cluster-critical
preemptionPolicy: PreemptLowerPriority
value: 2000000000  

We create another one for our specific tooling with slightly lower priority

apiVersion: scheduling.k8s.io/v1
description: Used for cluster critical tooling
kind: PriorityClass
metadata:
  name: cluster-critical-tooling
preemptionPolicy: PreemptLowerPriority
value: 1500000000

As example will take datadog installation. Setting new PriorityClass in HelmRelease.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:  
  name: datadog
spec:
  …
  values:
    agents:
      priorityClassName: cluster-critical-tooling

This will update all current pods and will make them more important than other load on the node.

spec:
  ...
  preemptionPolicy: PreemptLowerPriority
  priority: 1500000000
  priorityClassName: cluster-critical-tooling

Links

https://github.com/gardener/gardener
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources
https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption

About the author: Denis Khasbulatov
Comments
Join us