EKS AWS Kubernetes Karpenter Production

5 Things I Wish I Knew Before Running EKS in Production

Hard-won lessons from running Amazon EKS in production — from Karpenter node consolidation to OpenTelemetry observability and real AWS database integrations.

· 4 min read

Running Amazon EKS in a tutorial and running it in production are two very different experiences. After deploying a 5-microservice retail store application with real AWS services, here are the five lessons that would have saved me time, money, and plenty of late-night debugging sessions.

1. Cluster Autoscaler Doesn’t Consolidate Nodes

Cluster Autoscaler only removes empty nodes. If a node is running a single tiny pod at 10% utilization, it stays — and you keep paying for it.

Karpenter takes a different approach. It detects underutilized nodes and consolidates pods onto fewer, right-sized instances automatically. In my demo, I watched 4 nodes collapse into 1 after scaling down — no manual intervention needed.

The configuration is straightforward:

1
2
3
disruption:
  consolidationPolicy: WhenEmptyOrUnderutilized
  consolidateAfter: 30s

Those two lines of YAML enabled automatic cost optimization that Cluster Autoscaler simply cannot match.

2. Spot Instances Will Interrupt Without Warning

Spot instances offer around 70% savings over On-Demand, but AWS can reclaim them with just 2 minutes of notice. Without proper handling, your pods vanish mid-deployment and users see errors.

The solution is an event-driven architecture: Karpenter + EventBridge + SQS. When AWS sends an interruption warning, EventBridge forwards it to an SQS queue. Karpenter detects the message, provisions a new node, and migrates pods before the old node dies.

Combined with a PodDisruptionBudget that keeps a minimum number of pods running at all times, the result is zero downtime — even during Spot reclamations:

1
2
3
4
5
6
7
8
9
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: my-app

This single resource is the difference between “Spot is risky” and “Spot is production-ready.”

3. Hardcoding Secrets Will Haunt You

Environment variables embedded in YAML manifests are a fast path to a security incident. Secrets end up in Git history, they don’t rotate, and one leaked file exposes everything.

The production-grade approach: AWS Secrets Manager + the Secrets Store CSI Driver. Secrets are stored externally, mounted into pods at runtime, and can rotate on a schedule without restarting your application. Your security team will thank you.

4. Observability Isn’t Just “Install Prometheus”

True production observability requires three pillars — and each one answers a different question:

  • Traces (AWS X-Ray): Which service is slow? Where does the latency come from?
  • Logs (Amazon CloudWatch): What happened? What was the error message?
  • Metrics (Amazon Managed Prometheus + Grafana): What are the trends? Is CPU spiking?

I deployed three separate AWS Distro for OpenTelemetry (ADOT) collectors — one for each pillar — because they require different deployment modes. Traces use a Deployment (apps push data), Logs use a DaemonSet (reads from each node), and Metrics use a Deployment (pull-based scraping).

One critical lesson: without filtering health check endpoints from traces, X-Ray costs can explode. Kubernetes probes hit /health every 10 seconds, generating thousands of useless traces per hour. Adding an OpenTelemetry filter rule reduced my tracing costs by 85%.

5. Tutorial Databases Are Nothing Like Production

Tutorials love SQLite. Production needs something very different.

My retail store application uses five AWS managed services for its data layer:

ServicePurpose
RDS MySQLProduct catalog data
RDS PostgreSQLOrder processing
DynamoDBShopping cart (low-latency key-value)
ElastiCacheSession caching and checkout state
SQSAsynchronous order queue

Each service requires connection pooling, retry logic, and secrets rotation. The gap between a tutorial’s single-database setup and a real production data plane is enormous — and it’s where most engineers struggle when making the jump.


Key Takeaways

  1. Use Karpenter over Cluster Autoscaler for automatic node consolidation
  2. Handle Spot interruptions with EventBridge + SQS + PodDisruptionBudgets
  3. Externalize secrets with AWS Secrets Manager + CSI Driver
  4. Deploy three ADOT collectors — one per observability pillar — and filter health check traces
  5. Use real AWS managed services with proper connection pooling and retry logic

Every one of these patterns is something I cover with live demos in my Ultimate DevOps Real-World Project on AWS course, with complete Terraform provisioning from VPC to observability stack. The full source code is open on GitHub.

For more practical DevOps tips delivered weekly, join the newsletter.

Enjoyed this? Get more in your inbox.

Weekly DevOps & Cloud insights from a 383K+ Udemy instructor