How to Handle Spot Instance Interruptions on EKS with Zero Downtime

“Spot instances are too risky for production.”

That’s the most common objection I hear from DevOps engineers. And it’s wrong. With the right architecture, you can run production workloads on Spot instances with 70% cost savings and zero downtime during interruptions. Here’s exactly how.

The Fear (and Why It’s Overblown)

The concern is legitimate on the surface: AWS can reclaim a Spot instance with just 2 minutes of notice. Without preparation, your pods get terminated, requests fail, and users see errors.

But “without preparation” is the key phrase. With proper interruption handling, Spot becomes a production-ready capacity option — not a gamble.

The Architecture: Karpenter + EventBridge + SQS + PDB

The zero-downtime Spot architecture has four components working together:

AWS EventBridge receives the Spot interruption warning from AWS
SQS Queue buffers the notification for reliable delivery
Karpenter polls the SQS queue, detects the warning, and provisions a replacement node before draining the old one
PodDisruptionBudget ensures a minimum number of pods are always running during the migration

This event-driven approach gives you a controlled, graceful migration rather than a chaotic termination.

The Timeline: From Warning to Recovery

Here’s what happens during an actual Spot interruption, based on a live demo:

Time	Event
T+0s	Interruption warning received via EventBridge → SQS
T+10s	Karpenter detects warning, cordons the node
T+30s	New Spot node begins provisioning
T+50s	New node reaches Ready status
T+60s	First batch of pods migrated (respecting PDB)
T+90s	All 5 pods running on the new node
T+120s	Old node terminated by AWS

Total elapsed time: about 90 seconds from warning to fully recovered. Service disruption: zero.

The Secret Weapon: PodDisruptionBudget

The PodDisruptionBudget (PDB) is the component that turns a risky migration into a safe one. Without it, Kubernetes can evict all your pods simultaneously during a node drain, creating a window where no pods are serving traffic.

Without PDB:

All 5 pods evicted at once
Service down for 30-40 seconds
Users see errors

With PDB (minAvailable: 3):

Only 2 pods evicted at a time
3 pods always serving traffic
New pods scheduled before more evictions begin
Service never goes down

The configuration is minimal:

1
2
3
4
5
6
7
8
9
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: my-app

This tells Kubernetes: “Never let the number of running pods for this app drop below 3, regardless of what’s happening to the underlying nodes.”

The Cost Math

The financial case for Spot is straightforward:

Capacity Type	Monthly Cost	Interruption Risk	With Handling
On-Demand	~$100	None	N/A
Spot	~$30	~5% per month	Zero downtime

You’re paying 3x more for On-Demand “stability” that proper Spot handling already provides. The ~5% monthly interruption rate translates to maybe one or two interruptions per month — each resolved automatically in 90 seconds with no service impact.

Karpenter Node Consolidation: The Bonus

Beyond interruption handling, Karpenter also consolidates underutilized nodes automatically. After a traffic spike subsides and you scale down your pods, Karpenter detects the wasted capacity and migrates remaining pods to fewer, smaller nodes.

In one demo, scaling from 10 pods to 2 caused Karpenter to consolidate from 4 nodes down to 1 — saving 75% on compute costs without any manual intervention. The consolidation configuration:

1
2
3
disruption:
  consolidationPolicy: WhenEmptyOrUnderutilized
  consolidateAfter: 30s

Cluster Autoscaler only removes completely empty nodes. Karpenter removes underutilized nodes, which is far more aggressive at eliminating waste.

Production Tips

Set consolidateAfter to 60-120 seconds in production (30s is fine for demos but too aggressive for real workloads)
Use PodDisruptionBudgets on every production deployment — Karpenter respects them during consolidation too
Diversify instance types in your Spot NodePool — the more instance types Karpenter can choose from, the lower the interruption rate
Protect critical pods with the annotation karpenter.sh/do-not-disrupt: "true" for workloads that absolutely cannot move

Getting Started

If you’re running On-Demand nodes on EKS today, start by adding a Spot NodePool alongside your existing On-Demand one. Move non-critical workloads to Spot first, verify the interruption handling works, then gradually shift more workloads over.

I demonstrate the complete setup — Spot NodePool configuration, EventBridge + SQS Terraform automation, PDB strategies, and a live interruption simulation — in Section 17 of my Ultimate DevOps Real-World Project on AWS course. You’ll watch Karpenter handle an interruption in real time with zero service impact. All source code is available on GitHub.

For weekly DevOps insights and cost optimization tips, subscribe to the newsletter.

How to Handle Spot Instance Interruptions on EKS with Zero Downtime

The Fear (and Why It’s Overblown)

The Architecture: Karpenter + EventBridge + SQS + PDB

The Timeline: From Warning to Recovery

The Secret Weapon: PodDisruptionBudget

The Cost Math

Karpenter Node Consolidation: The Bonus

Production Tips

Getting Started

Related Articles

5 Things I Wish I Knew Before Running EKS in Production

Building a Complete Observability Stack for EKS with OpenTelemetry and ADOT

5 Terraform Mistakes That Cost You Money on AWS

Enjoyed this? Get more in your inbox.

Wait! Don't miss out.

Ultimate DevOps Real-World Project Implementation on AWS

The Fear (and Why It’s Overblown)

The Architecture: Karpenter + EventBridge + SQS + PDB

The Timeline: From Warning to Recovery

The Secret Weapon: PodDisruptionBudget

The Cost Math

Karpenter Node Consolidation: The Bonus

Production Tips

Getting Started

Related Articles

5 Things I Wish I Knew Before Running EKS in Production

Building a Complete Observability Stack for EKS with OpenTelemetry and ADOT

5 Terraform Mistakes That Cost You Money on AWS

Enjoyed this? Get more in your inbox.

Wait! Don't miss out.