Pod Descheduling and Scaling
This module explores how Kubernetes manages pod lifecycle beyond initial scheduling, including descheduling pods to optimize cluster utilization, and automatic scaling mechanisms that adjust workloads based on demand. Understanding these mechanisms is crucial for maintaining efficient and responsive applications.
Understanding Descheduling
Descheduling is the process of removing pods from nodes to improve cluster balance and resource utilization. Unlike eviction, which is driven by resource pressure, descheduling is a proactive optimization mechanism.
Descheduling vs. Eviction
It’s important to distinguish between descheduling and eviction:
-
Eviction: Reactive removal of pods due to resource pressure or policy violations (covered in Module 1)
-
Descheduling: Proactive removal of pods to optimize cluster balance and utilization
Descheduling typically moves pods to better-suited nodes, while eviction may terminate pods entirely.
Descheduling Behavior
| Descheduling does not happen by default in Kubernetes. The default scheduler only places pods; it does not move them after placement. To enable descheduling, you need a descheduler component that actively monitors and rebalances pods. |
The descheduler evaluates running pods and identifies opportunities to improve cluster balance by:
-
Moving pods from over-utilized nodes to under-utilized nodes
-
Consolidating pods to free up nodes for maintenance
-
Spreading pods more evenly across nodes
-
Removing duplicate pods when anti-affinity rules are violated
Descheduler Operator Profiles
The Descheduler Operator provides nine predefined profiles that combine various descheduling strategies. Each profile is designed to address specific cluster optimization scenarios. Profiles are configured through the Descheduler custom resource as documented in the OpenShift Descheduler Operator repository.
The available profiles are:
-
AffinityAndTaints
-
TopologyAndDuplicates
-
SoftTopologyAndDuplicates
-
LifecycleAndUtilization
-
LongLifecycle
-
CompactAndScale
-
KubeVirtRelieveAndMigrate
-
EvictPodsWithPVC
-
EvictPodsWithLocalStorage
Descheduler Operator Configuration
The Descheduler Operator is configured through a Descheduler custom resource. The operator runs as a controller that periodically evaluates the cluster and deschedules pods when conditions are met. Configuration typically includes:
-
Descheduling interval: How often to evaluate pods for descheduling
-
Strategy parameters: Thresholds and rules for each enabled strategy
-
Namespaces to include/exclude: Scope of descheduling operations
Example Descheduler configuration using profiles:
apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
name: cluster
namespace: openshift-kube-descheduler-operator
spec:
managementState: Managed
deschedulingIntervalSeconds: 3600
profiles:
- LifecycleAndUtilization
- RemovePodsViolatingInterPodAntiAffinity
mode: Automatic
evictionLimits:
total: 10
node: 5
Key configuration parameters:
-
profiles: List of profile names to enable (choose from the nine available profiles) -
deschedulingIntervalSeconds: How often the descheduler runs -
mode: EitherPredictive(simulation only) orAutomatic(actual evictions) -
evictionLimits: Restrict the number of evictions per run
Monitoring Descheduling
Monitor descheduling operations to understand cluster optimization:
# View Descheduler Operator status
oc get descheduler
# View detailed Descheduler configuration
oc describe descheduler cluster-descheduler
# View descheduler operator logs
oc logs -n openshift-descheduler-operator -l app=descheduler-operator
# Check for descheduled pods
oc get events --field-selector reason=Descheduled
# Monitor pod movements
oc get pods -o wide --watch
Horizontal Pod Autoscaler (HPA)
Horizontal Pod Autoscaler automatically scales the number of pod replicas based on observed metrics such as CPU utilization, memory usage, or custom metrics.
How HPA Works
HPA monitors specified metrics and adjusts the replica count of a Deployment, StatefulSet, or other scalable resource:
-
Metrics Collection: HPA queries metrics from the Metrics API
-
Target Calculation: Compares current metrics to target values
-
Replica Adjustment: Increases or decreases replicas to meet targets
-
Cooldown Periods: Respects scale-up and scale-down delays to prevent thrashing
Creating an HPA
Create an HPA to automatically scale a deployment:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
Key parameters:
-
minReplicas: Minimum number of replicas to maintain -
maxReplicas: Maximum number of replicas allowed -
metrics: Metrics to monitor and target values -
behavior: Scaling behavior including stabilization windows and policies
HPA Metrics
HPA supports various metric types:
Resource Metrics
Scale based on CPU or memory utilization:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
HPA Behavior Configuration
Control how HPA scales with behavior settings:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
selectPolicy: Max
-
stabilizationWindowSeconds: Time window to consider historical metrics -
policies: Scaling policies (Percent or Pods) -
selectPolicy: How to select among multiple policies (Max, Min, Disabled)
Vertical Pod Autoscaler (VPA)
Vertical Pod Autoscaler automatically adjusts CPU and memory requests and limits for containers based on historical usage data.
How VPA Works
VPA monitors pod resource usage and recommends or automatically updates resource requests and limits:
-
Metrics Collection: Collects resource usage metrics from pods
-
Analysis: Analyzes historical usage patterns
-
Recommendations: Generates resource recommendations
-
Updates: Optionally updates pod specifications automatically
VPA Modes
VPA operates in different modes:
-
Off: VPA only provides recommendations, doesn’t make changes
-
Initial: VPA sets resource requests on pod creation only
-
Auto: VPA updates resource requests on existing pods (requires pod recreation)
-
Recreate: VPA recreates pods to apply new resource settings
Creating a VPA
Create a VPA to manage resource requests:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-deployment
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
controlledResources: ["cpu", "memory"]
Key parameters:
-
updateMode: How VPA applies recommendations (Off, Initial, Auto, Recreate) -
minAllowed: Minimum resource values to set -
maxAllowed: Maximum resource values to set -
controlledResources: Which resources VPA should manage
VPA Recommendations
View VPA recommendations:
# View VPA status and recommendations
oc get vpa web-vpa -o yaml
# Get recommendation details
oc describe vpa web-vpa
VPA recommendations include:
-
Target: Recommended values for optimal performance
-
Lower Bound: Minimum values to maintain performance
-
Upper Bound: Maximum values observed
-
Uncapped Target: Recommendations without min/max constraints
VPA Resource Policies
Control which containers VPA manages:
resourcePolicy:
containerPolicies:
- containerName: app
mode: Auto
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
- containerName: sidecar
mode: Off
VPA Limitations
Important considerations when using VPA:
-
VPA cannot update resources of running pods; it requires pod recreation
-
VPA recommendations are based on historical data and may not reflect sudden changes
-
Using VPA with HPA requires careful coordination
-
VPA may cause pod churn when updating resources
Interaction Between Components
| Descheduling, Horizontal Pod Autoscaler (HPA), and Vertical Pod Autoscaler (VPA) can interfere with each other if all are deployed simultaneously. Understanding their interactions is crucial for stable cluster operation. |
Potential Conflicts
HPA and VPA
HPA and VPA both manage pod resources but in different ways:
-
HPA: Changes replica count based on resource utilization
-
VPA: Changes resource requests/limits per pod
If both are active on the same workload:
-
VPA may adjust resource requests, affecting utilization percentages that HPA monitors
-
HPA may scale replicas up/down, changing the workload that VPA analyzes
-
This can create feedback loops and unstable scaling behavior
Best practice: Use either HPA or VPA for a given workload, not both simultaneously.
Descheduler and HPA
Descheduler and HPA can conflict:
-
Descheduler: Moves pods to optimize cluster balance
-
HPA: Scales replica count based on metrics
Potential issues:
-
Descheduler may move pods that HPA is monitoring, affecting metrics
-
HPA scaling decisions may conflict with descheduler’s rebalancing efforts
-
Rapid scaling combined with descheduling can cause pod churn
Descheduler and VPA
Descheduler and VPA interactions:
-
Descheduler: Moves pods between nodes
-
VPA: Updates pod resource specifications
Potential issues:
-
VPA may recommend resource changes that trigger descheduling
-
Descheduler may move pods that VPA is analyzing
-
Pod recreation from VPA updates can interfere with descheduling decisions
Coordination Strategies
If you need to use multiple components:
Use HPA with Descheduler
-
Configure HPA with longer stabilization windows to reduce rapid scaling
-
Set descheduler to run less frequently
-
Use descheduler strategies that don’t conflict with HPA metrics
Monitoring Interactions
Monitor all components to detect conflicts:
# Monitor HPA scaling events
oc get events --field-selector involvedObject.kind=HorizontalPodAutoscaler
# Monitor VPA recommendations
oc get vpa --all-namespaces
# Monitor descheduler activity
oc logs -n kube-system -l app=descheduler
# Watch pod churn
oc get pods -w
Best Practices
-
Use HPA for workloads that can scale horizontally (stateless applications)
-
Use VPA for workloads that need resource optimization but can’t scale horizontally
-
Avoid using HPA and VPA on the same workload simultaneously
-
Configure descheduler with appropriate intervals to avoid excessive pod movement
-
Set HPA stabilization windows to prevent rapid scaling oscillations
-
Monitor all autoscaling components to detect conflicts early
-
Test autoscaling configurations in non-production environments first
-
Use Pod Disruption Budgets to protect critical workloads during scaling operations
-
Review VPA recommendations before enabling Auto mode
-
Coordinate descheduling with application maintenance windows when possible