Spot Instances EKS Karpenter AWS Cost Optimization

How to Handle Spot Instance Interruptions on EKS with Zero Downtime

A practical guide to running Spot instances on Amazon EKS without service disruption, using Karpenter, PodDisruptionBudgets, and EventBridge.

· 4 min read

“Spot instances are too risky for production.”

That’s the most common objection I hear from DevOps engineers. And it’s wrong. With the right architecture, you can run production workloads on Spot instances with 70% cost savings and zero downtime during interruptions. Here’s exactly how.

The Fear (and Why It’s Overblown)

The concern is legitimate on the surface: AWS can reclaim a Spot instance with just 2 minutes of notice. Without preparation, your pods get terminated, requests fail, and users see errors.

But “without preparation” is the key phrase. With proper interruption handling, Spot becomes a production-ready capacity option — not a gamble.

The Architecture: Karpenter + EventBridge + SQS + PDB

The zero-downtime Spot architecture has four components working together:

  1. AWS EventBridge receives the Spot interruption warning from AWS
  2. SQS Queue buffers the notification for reliable delivery
  3. Karpenter polls the SQS queue, detects the warning, and provisions a replacement node before draining the old one
  4. PodDisruptionBudget ensures a minimum number of pods are always running during the migration

This event-driven approach gives you a controlled, graceful migration rather than a chaotic termination.

The Timeline: From Warning to Recovery

Here’s what happens during an actual Spot interruption, based on a live demo:

TimeEvent
T+0sInterruption warning received via EventBridge → SQS
T+10sKarpenter detects warning, cordons the node
T+30sNew Spot node begins provisioning
T+50sNew node reaches Ready status
T+60sFirst batch of pods migrated (respecting PDB)
T+90sAll 5 pods running on the new node
T+120sOld node terminated by AWS

Total elapsed time: about 90 seconds from warning to fully recovered. Service disruption: zero.

The Secret Weapon: PodDisruptionBudget

The PodDisruptionBudget (PDB) is the component that turns a risky migration into a safe one. Without it, Kubernetes can evict all your pods simultaneously during a node drain, creating a window where no pods are serving traffic.

Without PDB:

  • All 5 pods evicted at once
  • Service down for 30-40 seconds
  • Users see errors

With PDB (minAvailable: 3):

  • Only 2 pods evicted at a time
  • 3 pods always serving traffic
  • New pods scheduled before more evictions begin
  • Service never goes down

The configuration is minimal:

1
2
3
4
5
6
7
8
9
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: my-app

This tells Kubernetes: “Never let the number of running pods for this app drop below 3, regardless of what’s happening to the underlying nodes.”

The Cost Math

The financial case for Spot is straightforward:

Capacity TypeMonthly CostInterruption RiskWith Handling
On-Demand~$100NoneN/A
Spot~$30~5% per monthZero downtime

You’re paying 3x more for On-Demand “stability” that proper Spot handling already provides. The ~5% monthly interruption rate translates to maybe one or two interruptions per month — each resolved automatically in 90 seconds with no service impact.

Karpenter Node Consolidation: The Bonus

Beyond interruption handling, Karpenter also consolidates underutilized nodes automatically. After a traffic spike subsides and you scale down your pods, Karpenter detects the wasted capacity and migrates remaining pods to fewer, smaller nodes.

In one demo, scaling from 10 pods to 2 caused Karpenter to consolidate from 4 nodes down to 1 — saving 75% on compute costs without any manual intervention. The consolidation configuration:

1
2
3
disruption:
  consolidationPolicy: WhenEmptyOrUnderutilized
  consolidateAfter: 30s

Cluster Autoscaler only removes completely empty nodes. Karpenter removes underutilized nodes, which is far more aggressive at eliminating waste.

Production Tips

  1. Set consolidateAfter to 60-120 seconds in production (30s is fine for demos but too aggressive for real workloads)
  2. Use PodDisruptionBudgets on every production deployment — Karpenter respects them during consolidation too
  3. Diversify instance types in your Spot NodePool — the more instance types Karpenter can choose from, the lower the interruption rate
  4. Protect critical pods with the annotation karpenter.sh/do-not-disrupt: "true" for workloads that absolutely cannot move

Getting Started

If you’re running On-Demand nodes on EKS today, start by adding a Spot NodePool alongside your existing On-Demand one. Move non-critical workloads to Spot first, verify the interruption handling works, then gradually shift more workloads over.

I demonstrate the complete setup — Spot NodePool configuration, EventBridge + SQS Terraform automation, PDB strategies, and a live interruption simulation — in Section 17 of my Ultimate DevOps Real-World Project on AWS course. You’ll watch Karpenter handle an interruption in real time with zero service impact. All source code is available on GitHub.

For weekly DevOps insights and cost optimization tips, subscribe to the newsletter.

Enjoyed this? Get more in your inbox.

Weekly DevOps & Cloud insights from a 383K+ Udemy instructor