๐ŸŽ‰ New Course

Ultimate DevOps Real-World Project Implementation on AWS

My newest course. Real-world DevOps on AWS with production architecture.

$15.99 $84.99 81% OFF

Coupon Code

Enroll Now on Udemy
MLOps Cost Optimization KServe Kubernetes
2 min read 266 words

ML Cost Optimization: One YAML Field Cut Our Bill by 80%

We changed minReplicas from 1 to 0. Infrastructure cost dropped 80%. Here is how KPA, scale-to-zero, and panic mode work for ML inference.

We changed one YAML field from 1 to 0. Infrastructure cost dropped 80%.

The field: minReplicas.

When set to 1, your ML inference pod runs 24/7. Even at 3 AM when nobody is making predictions. That’s $50-150 per month per model, running idle.

When set to 0, the pod scales to zero when idle. Traffic arrives, the pod spins up. Traffic stops, the pod disappears. You pay only for what you use.

ML Cost Optimization


KPA vs HPA

AutoscalerWatchesScales to Zero?
Kubernetes HPACPU and memoryNo. Minimum 1 pod always
Knative KPAConcurrency (requests per pod)Yes. Zero pods when idle

Same cluster. Same pods. Different autoscaler. Very different bill.


How KPA Tunes

Three parameters that matter most:

ParameterWhat It DoesOur Setting
scaleTargetRequests per pod before scaling out2 (default 10)
minReplicasMinimum pods0 (scale-to-zero)
windowObservation period30s

Panic Mode

Traffic spikes 2x in 6 seconds? KPA switches to panic mode. Instant scale-up. No waiting for the observation window. Pods appear immediately.

Once traffic stabilizes, KPA switches back to stable mode.


The Trade-off: Cold Start

ScenarioFirst Request Latency
Pod already runningInstant (milliseconds)
Scale-from-zero15-30 seconds (model loading)

When Scale-to-Zero Is Wrong

Use CaseminReplicas
Real-time fraud detection1 (never scale to zero, cold start = unblocked fraud)
Internal batch scoring0 (save 23 hours of compute)
Dev/staging0 (nobody watching at midnight)
Low-traffic models0 (best cost-performance ratio)

Match the scaling strategy to the business requirement. (See also Part 8: Scale-to-Zero fundamentals.)


This is Part 18 of the MLOps for DevOps Engineers series. For weekly updates, join the newsletter.

Share this article
K
Kalyan Reddy Daida

Instructor with 383,000+ students across 21 courses on AWS, Azure, GCP, Terraform, Kubernetes & DevOps. Sharing real-world patterns from production environments.

Enjoyed this? Get more in your inbox.

Weekly DevOps & Cloud insights from a 383K+ Udemy instructor