Pod Scheduling Fundamentals
The Kubernetes scheduler is responsible for assigning pods to nodes in your cluster. Understanding how scheduling works is fundamental to managing deployments effectively. This module explores the scheduling process and how to use secondary schedulers for specialized workload requirements.
The Scheduling Process
When you create a pod, it enters a Pending state until the scheduler assigns it to a node. The scheduling process consists of two main phases:
Filtering Phase
During the filtering phase, the scheduler evaluates all nodes and filters out those that cannot host the pod. A node is filtered out if:
-
It lacks sufficient resources (CPU, memory, storage)
-
It has taints that the pod cannot tolerate
-
It doesn’t match node selectors or node affinity rules
-
It has other constraints that prevent pod placement
Scoring Phase
After filtering, the scheduler scores each remaining node based on various factors:
-
Resource availability and balance
-
Pod affinity and anti-affinity preferences
-
Node preferences and topology constraints
-
Inter-pod affinity requirements
The node with the highest score is selected for pod placement.
Secondary Scheduler Operator
The Secondary Scheduler Operator enables you to deploy and manage additional schedulers alongside the default Kubernetes scheduler. This allows you to create specialized schedulers with custom policies and behaviors for different workload types.
Understanding Secondary Schedulers
Secondary schedulers run in parallel with the default scheduler, each with their own scheduling policies and configurations. Pods can specify which scheduler to use via the schedulerName field in their pod specification.
Benefits of using secondary schedulers include:
-
Workload-Specific Policies: Different schedulers can have different scoring algorithms and filters
-
Isolation: Critical workloads can use dedicated schedulers with guaranteed resources
-
Custom Logic: Implement domain-specific scheduling logic without modifying the default scheduler
-
Testing: Test new scheduling policies without affecting the default scheduler
How Secondary Schedulers Work
When a pod specifies a secondary scheduler name, the Kubernetes API server routes the pod to that scheduler instead of the default one. Each scheduler operates independently:
-
Pod Creation: Pod is created with
schedulerNamespecified -
Scheduler Selection: API server routes pod to the specified scheduler
-
Scheduling Process: Secondary scheduler evaluates nodes using its own policies
-
Node Binding: Scheduler binds pod to selected node
-
Pod Execution: Kubelet on the selected node starts the pod
Using Secondary Schedulers
To use a secondary scheduler, specify the scheduler name in your pod or deployment:
apiVersion: v1
kind: Pod
metadata:
name: secondary-scheduled-pod
spec:
schedulerName: my-custom-scheduler
containers:
- name: app
image: myapp:latest
For deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-deployment
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
schedulerName: my-custom-scheduler
containers:
- name: app
image: myapp:latest
Secondary Scheduler Configuration
Secondary schedulers can be configured with custom profiles, plugins, and scoring strategies. The configuration is typically managed through ConfigMaps or the scheduler’s deployment configuration.
Example scheduler configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: secondary-scheduler-config
namespace: secondary-scheduler
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: my-custom-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
weight: 2
- name: NodeAffinity
weight: 3
- name: PodTopologySpread
weight: 1
filter:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
- name: TaintToleration
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
Use Cases for Secondary Schedulers
High-Priority Workloads
Create a dedicated scheduler for high-priority workloads with aggressive resource allocation:
apiVersion: v1
kind: Pod
metadata:
name: critical-pod
spec:
schedulerName: high-priority-scheduler
priorityClassName: high-priority
containers:
- name: app
image: critical-app:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
Verifying Scheduler Assignment
You can verify which scheduler handled a pod:
# Check pod scheduler name
oc get pod <pod-name> -o jsonpath='{.spec.schedulerName}'
# View scheduler events
oc describe pod <pod-name> | grep -i scheduler
# Check secondary scheduler logs
oc logs -n kube-system -l app=secondary-scheduler
Scheduler Selection Behavior
If a pod specifies a scheduler name that doesn’t exist or isn’t running:
-
The pod remains in
Pendingstate -
Events will indicate the scheduler is not available
-
The default scheduler will not handle the pod automatically
Always ensure the specified scheduler is deployed and running before creating pods that reference it.
Best Practices for Secondary Schedulers
-
Use secondary schedulers for workloads with specific scheduling requirements
-
Ensure secondary schedulers are highly available (multiple replicas)
-
Monitor secondary scheduler performance and resource usage
-
Document which workloads use which schedulers
-
Test scheduler configurations thoroughly before deploying to production
-
Use descriptive scheduler names that indicate their purpose
-
Consider scheduler resource requirements when planning cluster capacity
Pod Priority and Preemption
Pod priority and preemption allow you to influence the relative importance of pods and enable higher-priority pods to evict lower-priority pods when resources are constrained.
Understanding Pod Priority
Pod priority determines the relative importance of pods during scheduling and resource contention. Higher-priority pods are scheduled before lower-priority pods, and when resources are insufficient, lower-priority pods may be preempted (evicted) to make room for higher-priority pods.
Priority Classes
Priority classes define priority levels that can be assigned to pods:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: "High priority class for critical workloads"
Key parameters:
-
value: The priority value (higher numbers = higher priority)
-
globalDefault: If true, this priority class is used for pods without a priorityClassName
-
description: Human-readable description of the priority class
Priority values can range from -2147483648 to 1000000000 (inclusive). System-level priority classes typically use values below 2000000000.
Using Priority Classes
Assign priority to pods by specifying a priority class:
apiVersion: v1
kind: Pod
metadata:
name: high-priority-pod
spec:
priorityClassName: high-priority
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
For deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-deployment
spec:
replicas: 3
selector:
matchLabels:
app: critical
template:
metadata:
labels:
app: critical
spec:
priorityClassName: high-priority
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
Preemption
Preemption occurs when a higher-priority pod cannot be scheduled because nodes lack sufficient resources. The scheduler may evict lower-priority pods to make room for the higher-priority pod.
Preemption behavior:
-
Only pods with lower priority can be preempted
-
Preemption respects Pod Disruption Budgets (PDBs)
-
Preempted pods are gracefully terminated
-
Preemption only occurs when necessary for scheduling higher-priority pods
Preemption Scenarios
Preemption can occur in several scenarios:
Insufficient Resources
When a high-priority pod requires resources that are currently allocated to lower-priority pods:
# Low priority pod running
apiVersion: v1
kind: Pod
metadata:
name: low-priority-pod
spec:
priorityClassName: low-priority
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
# High priority pod that needs the same resources
apiVersion: v1
kind: Pod
metadata:
name: high-priority-pod
spec:
priorityClassName: high-priority
containers:
- name: app
image: critical-app:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
If the high-priority pod cannot be scheduled, the scheduler may preempt the low-priority pod.
Node Resource Constraints
When nodes are at capacity and a high-priority pod needs to be scheduled:
# Check node resource usage
oc describe nodes | grep -A 5 "Allocated resources"
# View pods by priority
oc get pods --all-namespaces --sort-by=.spec.priority
The scheduler evaluates which lower-priority pods can be preempted to accommodate the higher-priority pod.
Preemption Process
When preemption occurs:
-
Evaluation: Scheduler identifies lower-priority pods that can be preempted
-
PDB Check: Verifies that preemption won’t violate Pod Disruption Budgets
-
Graceful Termination: Sends termination signal to preempted pods
-
Resource Release: Waits for resources to be freed
-
Scheduling: Schedules the higher-priority pod
Viewing Priority and Preemption
Monitor pod priority and preemption events:
# View pod priority
oc get pod <pod-name> -o jsonpath='{.spec.priorityClassName}'
oc get pod <pod-name> -o jsonpath='{.spec.priority}'
# View all priority classes
oc get priorityclasses
# View preemption events
oc get events --field-selector reason=Preempted
# Check for preempted pods
oc get pods --all-namespaces --field-selector=status.phase=Failed | grep Preempted
Priority Class Best Practices
-
Use priority classes to distinguish critical workloads from best-effort workloads
-
Create a small number of well-defined priority levels
-
Document priority class usage and policies
-
Avoid using
globalDefault: trueunless necessary -
Monitor preemption events to understand cluster behavior
-
Use Pod Disruption Budgets to protect important workloads from preemption
-
Test priority configurations in non-production environments
-
Consider resource requirements when assigning priorities
Priority and Resource Requests
Priority works in conjunction with resource requests:
-
Pods without resource requests cannot preempt pods with requests
-
Priority alone doesn’t guarantee scheduling; resources must be available
-
Higher-priority pods still need sufficient resources to be scheduled
-
Preemption only occurs when lower-priority pods can be evicted
Example:
apiVersion: v1
kind: Pod
metadata:
name: high-priority-no-requests
spec:
priorityClassName: high-priority
containers:
- name: app
image: myapp:latest
# No resource requests - cannot preempt other pods
This pod has high priority but cannot preempt other pods because it has no resource requests.