Canary Deployments for ML Models with KServe and Istio

You do canary deployments for APIs every day. Why not for ML models?

New model ready. Looks good in testing. Deploy to production. Hope it works. It doesn’t. Rollback takes 5 minutes. Five minutes of garbage predictions. Damage done.

How It Works

Role	Traffic	Description
Champion (80%)	Production traffic	Current model, proven, stable
Canary (20%)	Test traffic	New version, running alongside

Both run simultaneously. Same endpoint. Istio handles the traffic split.

The 4-Step Process

Step 1: Deploy your champion model.

Step 2: Add canaryTrafficPercent: 20 to the KServe InferenceService.

Step 3: KServe + Istio routes traffic automatically. 80% to champion pods, 20% to canary pods.

Step 4: Evaluate and decide.

Canary good? Promote to @champion. Takes seconds.
Canary bad? Remove the traffic split. Zero impact.

The same pattern you use for microservices. The same Istio you already know. Applied to ML models.

Canary vs Full Cutover

Approach	Risk	Rollback Time
Full cutover	100% traffic hits new model	Minutes (redeploy)
Canary	Only 20% traffic at risk	Seconds (remove split)

What to Monitor During Canary

While the canary is running, watch these metrics across both models:

Prediction latency (P50, P95, P99). New model significantly slower? Problem.
Error rate. Any 5xx responses from the canary? Kill it immediately.
Prediction distribution. Is the canary predicting significantly differently from the champion? Could mean a bug in preprocessing.
Business metrics. If you can track downstream outcomes (conversion, fraud caught), compare them.

The canary period should last long enough to see real traffic patterns. For most models, 24-48 hours covers enough variety in user behavior.

The DevOps Parallel

You already know this pattern.

Istio traffic splitting works the same way for APIs and ML models. KServe adds ML-specific features: model format support, GPU scheduling, and scale-to-zero for the canary when testing is done.

Same infrastructure. Same Istio. Same observability stack. Applied to ML models.

This is Part 6 of the MLOps for DevOps Engineers series. Next: the two-container pattern for separating preprocessing from inference.

For weekly updates, join the newsletter.

Canary Deployments for ML Models with KServe and Istio

How It Works

The 4-Step Process

Canary vs Full Cutover

What to Monitor During Canary

The DevOps Parallel

Related Articles

Multi-Model Serving on Kubernetes: 50 Models, One Cluster

ML Security on Kubernetes: 4 Layers Protecting Your Models

5 Levels of ML Model Deployment on Kubernetes

Enjoyed this? Get more in your inbox.

Wait! Don't miss out.

Ultimate DevOps Real-World Project Implementation on AWS

How It Works

The 4-Step Process

Canary vs Full Cutover

What to Monitor During Canary

The DevOps Parallel

Related Articles

Multi-Model Serving on Kubernetes: 50 Models, One Cluster

ML Security on Kubernetes: 4 Layers Protecting Your Models

5 Levels of ML Model Deployment on Kubernetes

Enjoyed this? Get more in your inbox.

Wait! Don't miss out.