Kubernetes Scheduling

What is the need for Node Affinity? #

Control Where Pods Run: Sometimes, you want Pods to run only on specific nodes (e.g., nodes with GPUs, SSDs, or located in a particular zone)
Use Node Affinity for Targeted Scheduling: Node Affinity lets you influence which nodes a Pod can be scheduled on, based on node labels
Ensure Resource Compatibility: Schedule Pods on nodes that meet special hardware or configuration needs, like GPU nodes for ML workloads
Enhance Performance and Isolation: Co-locate Pods that benefit from shared cache or avoid placing high-traffic Pods on same node for better performance
Improve Fault Tolerance: Run workloads on nodes in separate locations to reduce the impact of failures in one area
Avoid Overloading Critical Nodes: Keep non-critical apps away from nodes reserved for sensitive or resource-heavy workloads

Practical Example: Node Affinity #

# WHAT: Pod with Node Affinity Rules
# WHY:  Ensure this Pod is scheduled only on nodes
#       with specific labels (e.g., node-type=high-mem)
# WHEN: Needed when workloads require specialized
#       hardware, OS, or zone-specific nodes

apiVersion: v1
kind: Pod
metadata:
  name: memory-intensive-app
spec:
  containers:
    - name: app
      image: nginx
      resources:
        requests:
          memory: "512Mi"
          cpu: "250m"
      # Simple demo container; replace as needed

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        # WHAT: Hard requirement for node selection
        # WHY: Pod will only be scheduled if the
        #       rule matches at scheduling time
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-type
                operator: In
                values:
                  - high-mem
                # WHAT: Match nodes labeled with
                #       node-type=high-mem
                # WHY: Needed for workloads that
                #       require large memory capacity

      # ALTERNATIVE: Use 'preferred' affinity
      # Allows more flexible scheduling
      #
      # preferredDuringSchedulingIgnoredDuringExecution:
      #   - weight: 1
      #     preference:
      #       matchExpressions:
      #         - key: node-type
      #           operator: In
      #           values:
      #             - high-mem
      #   # WHAT: Try to schedule on high-mem nodes
      #   # WHY: Helps when strict placement isn't required
      #   # WHEN: Use when you prefer a node but can fall back

# TO MAKE THIS WORK:
# Label your node using:
# kubectl label nodes <node-name> node-type=high-mem

What is the difference between Node Affinity and Node Selector? #

# Node Selector - OLD (Basic)
nodeSelector:
  disktype: ssd

# Node Affinity - NEW and RECOMMENDED
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: disktype
          operator: In  # Also supports NotIn, Exists, etc.
          values:
          - ssd
          - another-value

Simpler and Older Version: Node Label Selector is Simpler and Older version of Node Affinity
Straightforward Syntax: Easy to understand and implement for basic use cases
Limited Flexibility: Only supports exact matches with one or more key-value pairs — Cannot define complex conditions with advanced operators (In, NotIn)
Node Affinity is Recommended for Modern Use: Node Affinity is Fully backward-compatible with Node Selector logic and is more flexible

Why do we need Pod Affinity? #

Control Pod Co-Location: Sometimes you want certain Pods to be scheduled on the same node as others — for better performance or tighter integration
Use Pod Affinity to Group Related Pods: Ensures Pods with similar roles (e.g., app and sidecar, or tightly coupled microservices) are placed together
Reduce Latency Between Services: Co-locating Pods improves communication speed when services need to frequently talk to each other
Support Shared Resource Usage: Helps when Pods share a persistent volume (e.g., via ReadWriteMany) or access the same local cache or hardware
Improve Data Locality and Throughput: Useful in big data or AI workloads where data needs to stay close to compute for fast access
Apply Topology-Aware Scheduling: Pod affinity can be used across zones, or nodes — by specifying the topologyKey

Practical Example: Pod Affinity #

# WHAT: Pod with Pod Affinity
# WHY:  Schedule this Pod close to another Pod
#       that has specific labels (e.g., app=backend)
# WHEN: Needed for performance or locality reasons—
#       for example, frontend near backend
# WHERE: Works within same topology (e.g., zone or node)

apiVersion: v1
kind: Pod
metadata:
  name: frontend-app
spec:
  containers:
    - name: app
      image: nginx
      # Simple container for demo; replace as needed

  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        # WHAT: Hard rule for scheduling
        # WHY: Pod will only be scheduled if another
        #      matching Pod exists in the same zone
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - backend
                # WHAT: Match any Pod with app=backend
          topologyKey: "kubernetes.io/hostname"
          # WHAT: Match must be in same node
          # OPTIONS:
          # - kubernetes.io/hostname: same node
          # - topology.kubernetes.io/zone: same zone
          # - topology.kubernetes.io/region: same region

      # ALTERNATIVE: Use preferred affinity
      # This allows scheduling elsewhere if match fails
      #
      # preferredDuringSchedulingIgnoredDuringExecution:
      #   - weight: 100
      #     podAffinityTerm:
      #       labelSelector:
      #         matchExpressions:
      #           - key: app
      #             operator: In
      #             values:
      #               - backend
      #       topologyKey: "kubernetes.io/hostname"
      #   # WHAT: Prefer placing near 'backend' Pods
      #   # WHY: Useful when co-location is desirable
      #   # WHEN: Use when strict placement isn't required

# TO MAKE THIS WORK:
# Make sure another Pod exists with:
# labels:
#   app: backend

What is Anti-Affinity? #

Avoid Placing Similar Pods Together: Anti-affinity ensures that certain Pods are not scheduled on the same node (or zone) as other matching Pods
Define Rules Using Labels and Topology: Anti-affinity is label-based and supports topologyKey (like kubernetes.io/hostname or topology.kubernetes.io/zone) to control the scope of separation
Improve High Availability: By spreading replicas across nodes or zones, anti-affinity helps prevent all replicas from going down due to a single node or zone failure
Prevent Single Point of Failure: Especially useful for replicated workloads (like StatefulSets) where all Pods shouldn’t fail together
Minimize Resource Contention: Keeps resource-hungry Pods apart so they don’t compete for CPU, memory, or disk on the same node

Practical Example: Anti Affinity #

# WHAT: Pod with Pod Anti-Affinity
# WHY:  Ensure this Pod does NOT run on the same node
#       as any Pod with label app=backend
# WHEN: Used to spread workloads for availability,
#       performance, or compliance
# WHERE: Works within defined topology (e.g., node)

apiVersion: v1
kind: Pod
metadata:
  name: analytics-app
spec:
  containers:
    - name: analytics
      image: busybox
      command: ["sleep", "3600"]
      # Dummy workload to hold the Pod

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        # WHAT: Hard rule to avoid specific co-location
        # WHY: Prevent running alongside backend Pods
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - backend
                # WHAT: Match Pods labeled app=backend
          topologyKey: "kubernetes.io/hostname"
          # WHAT: Enforce on same node level
          # OPTIONS:
          # - kubernetes.io/hostname: different nodes
          # - topology.kubernetes.io/zone: different zones

# TO TEST:
# Schedule another Pod with:
# labels:
#   app: backend

Node Affinity vs Pod Affinity vs Pod Anti-Affinity #

Feature	Description	Use Case
Node Affinity	Schedules pods on nodes based on specific node labels and conditions	Useful for targeting specific nodes with defined hardware or location (e.g., ML workloads needing GPUs)
Pod Affinity	Ensures pods are scheduled close to other related pods (same node, zone or region)	Ideal for services that work closely together (e.g., web and cache pods)
Pod Anti-Affinity	Prevents pods from being scheduled nearby other specific pods	Useful for high-availability, ensuring redundancy by spreading pods (e.g., databases and queues that need high availability and high durability)

What are Taints and Tolerations? #

Prevent Pods from Running on Unsuitable Nodes: Taints let you mark a node as unsuitable for general Pods unless they explicitly tolerate it
Use Taints to Repel Pods: Nodes can be tainted to repel Pods that don’t match
Allow Specific Pods Using Tolerations: Pods add tolerations that match taints, signaling they’re allowed to run on those nodes
- Taint = Node with “Only Certified Workloads Allowed” Sign
- Toleration = Pod with “Certified Workloads” Badge
Isolate Workloads Based on Node Roles: Run sensitive or special workloads (like GPU jobs or system daemons) only on tainted nodes, avoiding interference
Control Scheduling Behavior Precisely: Combine taints and tolerations to enforce rules such as:
- Don’t schedule general Pods on GPU nodes (NoSchedule)
- Prefer not to run Pods on preemptible nodes (PreferNoSchedule)
- Evict Pods if a taint is added at runtime (NoExecute)
DO YOU KNOW?: Kubernetes taints master/control-plane nodes to prevent regular workloads (Pods) from running on them
- kubectl describe node <master-node-name> -> Taints: node-role.kubernetes.io/master:NoSchedule ("Don’t schedule Pods here unless they tolerate this taint.")

Practical Example: Taints and Tolerations #

# ---------------------------------------------
# STEP 1: Taint the Node (done separately via CLI)
#
# Command:
# kubectl taint nodes node-1 teamA=true:NoSchedule
#
# WHAT: Adds a taint to 'node-1'
# WHY:  Prevent Pods from scheduling unless they tolerate it
# FORMAT: key=value:Effect
# EFFECTS:
# - NoSchedule: Don't schedule unless tolerated
# - PreferNoSchedule: Try to avoid if possible
# - NoExecute: Evict existing Pods that don't tolerate
# ---------------------------------------------

# ---------------------------------------------
# STEP 2: Pod with Matching Toleration
# ---------------------------------------------

apiVersion: v1
kind: Pod
metadata:
  name: team-a-app
spec:
  containers:
    - name: nginx
      image: nginx
      # Simple web server container

  tolerations:
    - key: "teamA"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
      # WHAT: Allows scheduling onto nodes with
      #        taint 'teamA=true:NoSchedule'
      # WHY:  Pod explicitly says "I tolerate this taint"
      # WHEN: Node is tainted for team-specific workloads

# ---------------------------------------------
# VARIATION: Tolerate Any Taint with Specific Effect
#
# tolerations:
# - operator: "Exists"
#   effect: "NoSchedule"
#
# WHAT: Tolerates any taint with effect 'NoSchedule'
# WHY:  Used when key/value don’t matter
# ---------------------------------------------

# Master Node
# Taints: node-role.kubernetes.io/master:NoSchedule
#  
# To Run something on Master Node
# Add a toleration in the spec
# tolerations:
#  - key: "node-role.kubernetes.io/master"
#    operator: "Exists"
#    effect: "NoSchedule"

What is a Pod Disruption Budget? #

Prevent Too Many Pods from Going Down at Once: Pod Disruption Budgets (PDBs) ensure a minimum number of replicas stay available during voluntary disruptions
Handle Planned Maintenance Safely: During node upgrades, autoscaling, or draining, PDBs prevent all replicas from being evicted at the same time
Maintain Application Availability: Keep enough Pods running to serve traffic
Protect Stateful and Critical Apps: Use PDBs for databases, APIs, or services that require high availability — even during cluster changes
Avoid Downtime in Rolling Updates: Kubernetes respects PDBs during deployments and avoids deleting too many Pods at once

# WHAT: Pod Disruption Budget (PDB)
# WHY:  Ensures availability during voluntary
#       disruptions (e.g., node upgrades)
# WHEN: Use for critical apps needing high uptime

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 1
  # WHAT: Keep at least 1 Pod always running
  # WHY:  Prevent full downtime during drains
  # OPTIONS:
  # - minAvailable: count or %
  # - maxUnavailable: count or %
  # NOTE: Use only one at a time

  selector:
    matchLabels:
      app: my-app
      # WHAT: Target Pods with this label
      # WHY:  Apply PDB only to specific Pods

# -----------------------------------------
# VARIATION: Use maxUnavailable instead
#
# spec:
#   maxUnavailable: 1
#
# WHAT: Allow disruption of 1 Pod at a time
# -----------------------------------------

Cordon vs Drain #

	Cordon	Drain
Function	Marks a node as unschedulable; prevents new Pods from being scheduled	Evicts all Pods from the node; prepares node for maintenance
Pod Removal	Does not remove existing pods; they continue running	Evicts and gracefully terminates pods, moving them to other nodes
Use Case	Stop new workloads without disrupting running ones	Remove all workloads to safely reboot, upgrade, or decommission the node
Effect on Node	Node remains active and continues serving current Pods	Node becomes empty and can be safely maintained or removed
Command	`kubectl cordon <node-name>`	`kubectl drain <node-name>`

Debugging Problems with Starting Up Pods #

No Pod is Running: Pod is not created or crashes immediately due to issues in its definition or during startup
- Focus on Pod Configuration First!
- (COMMON ISSUE) Something is wrong in Pod definition or startup: Check for YAML errors, invalid image names, missing configs, volume mount failures, or init container errors
- Use Diagnostic Commands: Use kubectl get pods, kubectl describe pod, and kubectl logs to identify startup issues
Some Pods Running, Others Not: Some Pods are scheduled and running, but others are stuck in Pending state
- (COMMON ISSUE) Pod scheduling is blocked due to cluster/node constraints: Issues with resources, taints, affinity, node selectors, or Anti-Affinity can prevent scheduling
- Use Scheduling Diagnostics: Use kubectl describe pod, kubectl describe node, and kubectl get nodes --show-labels to debug scheduling issues

On this page

What is the need for Node Affinity? #

Practical Example: Node Affinity #

What is the difference between Node Affinity and Node Selector? #

Why do we need Pod Affinity? #

Practical Example: Pod Affinity #

What is Anti-Affinity? #

Practical Example: Anti Affinity #

Node Affinity vs Pod Affinity vs Pod Anti-Affinity #

What are Taints and Tolerations? #

Practical Example: Taints and Tolerations #

What is a Pod Disruption Budget? #

Cordon vs Drain #

Debugging Problems with Starting Up Pods #