How to Handle Spot Instance Interruptions on EKS with Zero Downtime
A practical guide to running Spot instances on Amazon EKS without service disruption, using Karpenter, PodDisruptionBudgets, and EventBridge.
“Spot instances are too risky for production.”
That’s the most common objection I hear from DevOps engineers. And it’s wrong. With the right architecture, you can run production workloads on Spot instances with 70% cost savings and zero downtime during interruptions. Here’s exactly how.
The Fear (and Why It’s Overblown)
The concern is legitimate on the surface: AWS can reclaim a Spot instance with just 2 minutes of notice. Without preparation, your pods get terminated, requests fail, and users see errors.
But “without preparation” is the key phrase. With proper interruption handling, Spot becomes a production-ready capacity option — not a gamble.
The Architecture: Karpenter + EventBridge + SQS + PDB
The zero-downtime Spot architecture has four components working together:
- AWS EventBridge receives the Spot interruption warning from AWS
- SQS Queue buffers the notification for reliable delivery
- Karpenter polls the SQS queue, detects the warning, and provisions a replacement node before draining the old one
- PodDisruptionBudget ensures a minimum number of pods are always running during the migration
This event-driven approach gives you a controlled, graceful migration rather than a chaotic termination.
The Timeline: From Warning to Recovery
Here’s what happens during an actual Spot interruption, based on a live demo:
| Time | Event |
|---|---|
| T+0s | Interruption warning received via EventBridge → SQS |
| T+10s | Karpenter detects warning, cordons the node |
| T+30s | New Spot node begins provisioning |
| T+50s | New node reaches Ready status |
| T+60s | First batch of pods migrated (respecting PDB) |
| T+90s | All 5 pods running on the new node |
| T+120s | Old node terminated by AWS |
Total elapsed time: about 90 seconds from warning to fully recovered. Service disruption: zero.
The Secret Weapon: PodDisruptionBudget
The PodDisruptionBudget (PDB) is the component that turns a risky migration into a safe one. Without it, Kubernetes can evict all your pods simultaneously during a node drain, creating a window where no pods are serving traffic.
Without PDB:
- All 5 pods evicted at once
- Service down for 30-40 seconds
- Users see errors
With PDB (minAvailable: 3):
- Only 2 pods evicted at a time
- 3 pods always serving traffic
- New pods scheduled before more evictions begin
- Service never goes down
The configuration is minimal:
| |
This tells Kubernetes: “Never let the number of running pods for this app drop below 3, regardless of what’s happening to the underlying nodes.”
The Cost Math
The financial case for Spot is straightforward:
| Capacity Type | Monthly Cost | Interruption Risk | With Handling |
|---|---|---|---|
| On-Demand | ~$100 | None | N/A |
| Spot | ~$30 | ~5% per month | Zero downtime |
You’re paying 3x more for On-Demand “stability” that proper Spot handling already provides. The ~5% monthly interruption rate translates to maybe one or two interruptions per month — each resolved automatically in 90 seconds with no service impact.
Karpenter Node Consolidation: The Bonus
Beyond interruption handling, Karpenter also consolidates underutilized nodes automatically. After a traffic spike subsides and you scale down your pods, Karpenter detects the wasted capacity and migrates remaining pods to fewer, smaller nodes.
In one demo, scaling from 10 pods to 2 caused Karpenter to consolidate from 4 nodes down to 1 — saving 75% on compute costs without any manual intervention. The consolidation configuration:
| |
Cluster Autoscaler only removes completely empty nodes. Karpenter removes underutilized nodes, which is far more aggressive at eliminating waste.
Production Tips
- Set
consolidateAfterto 60-120 seconds in production (30s is fine for demos but too aggressive for real workloads) - Use PodDisruptionBudgets on every production deployment — Karpenter respects them during consolidation too
- Diversify instance types in your Spot NodePool — the more instance types Karpenter can choose from, the lower the interruption rate
- Protect critical pods with the annotation
karpenter.sh/do-not-disrupt: "true"for workloads that absolutely cannot move
Getting Started
If you’re running On-Demand nodes on EKS today, start by adding a Spot NodePool alongside your existing On-Demand one. Move non-critical workloads to Spot first, verify the interruption handling works, then gradually shift more workloads over.
I demonstrate the complete setup — Spot NodePool configuration, EventBridge + SQS Terraform automation, PDB strategies, and a live interruption simulation — in Section 17 of my Ultimate DevOps Real-World Project on AWS course. You’ll watch Karpenter handle an interruption in real time with zero service impact. All source code is available on GitHub.
For weekly DevOps insights and cost optimization tips, subscribe to the newsletter.