๐ŸŽ‰ New Course

Ultimate DevOps Real-World Project Implementation on AWS

My newest course. Real-world DevOps on AWS with production architecture.

$15.99 $84.99 81% OFF

Coupon Code

Enroll Now on Udemy
MLOps KServe Kubernetes Multi-Model Serving
5 min read 934 words

Multi-Model Serving on Kubernetes: 50 Models, One Cluster

50 models. 10 active. 40 at zero. One cluster. Here is how mature ML platforms run dozens of models on shared infrastructure with 80% cost savings.

50 models. 10 active. 40 at zero. One cluster.

That is the reality of a mature ML platform. Not one model per team. Not one namespace per endpoint. Dozens of models sharing infrastructure, scaling independently, and costing almost nothing when idle.

Most teams never get here. They get stuck at the single-model trap.

Multi-Model Serving on Kubernetes


The Single-Model Trap

Team A deploys their fraud model. Gets its own namespace, its own Istio gateway, its own monitoring stack. Works great.

Team B deploys a churn predictor. Same setup. Another namespace, another gateway, another dashboard.

By model number 10:

  • 10 namespaces
  • 10 gateways
  • 10 sets of credentials
  • 10 monitoring stacks
  • An infrastructure team that spends more time on plumbing than enabling data scientists

It does not scale.


Multi-Model Architecture

The pattern that works: a shared inference pool.

LayerPattern
ClusterOne Kubernetes cluster, owned by ML Platform team
NamespacesOne per team (not per model)
InferenceServicesMany KServe InferenceService per namespace
GatewayOne shared Istio ingress, host-based routing
MonitoringOne Prometheus + Grafana stack, per-model metrics

Team Fraud deploys 8 models. Team Recommendations deploys 12. Team Risk deploys 30. All in their own namespaces. All sharing compute.


Multi-Model KServe YAML

Two models in one team namespace, both with scale-to-zero:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector-v2
  namespace: team-fraud
spec:
  predictor:
    minReplicas: 0
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/fraud-detector-v2
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: churn-predictor-v1
  namespace: team-fraud
spec:
  predictor:
    minReplicas: 0
    model:
      modelFormat:
        name: xgboost
      storageUri: s3://models/churn-predictor-v1

Same namespace. Independent scaling. Independent scale-to-zero. Each model gets a unique hostname automatically.


Scale-to-Zero at 50-Model Scale

This is where it gets interesting. 50 models deployed. Only 10 active at any moment. The other 40 are scaled to zero. No pods. No compute. No cost.

When a request hits a sleeping model, Knative’s Activator intercepts it, spins up the pod, loads the model, and serves the prediction. Cold start: 15 to 30 seconds. Then warm until traffic stops again.

The Cost Math

SetupMonthlyAnnual3-Year
Always-on (50 models ร— $55/mo)$2,750$33,000$99,000
Scale-to-zero (~10 active, $12 avg)$600$7,200$21,600
Savings~$2,150/mo$25,800/yr$77,400

80% savings on a single cluster. From one architectural decision.

The catch: cold starts. For batch scoring and internal tools, fine. For customer-facing fraud detection, set minReplicas: 1 on those specific models. Sweet spot: 5-8 critical models always-on, everything else scale-to-zero.

(See Part 8: Scale-to-Zero and Part 18: ML Cost Optimization for the full cost stack.)


Namespace Strategy

Three patterns. Pick one and commit.

StrategyPatternBest For
By Team (recommended)team-fraud, team-reco, team-risk3+ ML teams, self-service
By Environmentml-dev, ml-staging, ml-prodSmall orgs, <5 models total
By Model Typerealtime-serving, batch-scoringMixed SLA workloads

Hybrid: team-fraud-prod, team-fraud-staging. More namespaces, cleaner boundaries. Use labels to track team, environment, and cost-center regardless of strategy.


Model Routing with Istio

Every InferenceService gets a unique hostname. Istio routes by Host header. One gateway. One IP. Different backends.

Request Flow

  1. Client POST to fraud-v2.ml-serving.example.com
  2. DNS resolves to the shared Istio IngressGateway IP
  3. Istio reads the Host header
  4. VirtualService routes to the correct KServe service
  5. If pod is at zero, Activator intercepts and spins it up
  6. Prediction returned

Custom Path-Based Routing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ml-routing
spec:
  hosts: ["*.ml-serving.example.com"]
  gateways: [ml-gateway]
  http:
    - match:
        - uri:
            prefix: /v1/models/fraud
      route:
        - destination:
            host: fraud-detector-v2.team-fraud.svc
    - match:
        - uri:
            prefix: /v1/models/churn
      route:
        - destination:
            host: churn-predictor-v1.team-fraud.svc

Unlimited models. Path-based or host-based. Your choice.


Monitoring 50 Models on One Cluster

Cluster-level dashboards lie. You need per-model metrics.

MetricWhy
Request countIs this model being used?
Latency p50/p95/p99Is it fast enough?
Error rateIs it healthy?
Pod countScaled up or at zero?
Cold start frequencyHow often does it wake up?
Model versionWhich version is serving?

Useful Prometheus Queries

1
2
3
4
5
6
7
8
9
# Request rate per model
rate(revision_request_count[5m]) by (service_name)

# P95 latency per model
histogram_quantile(0.95,
  rate(revision_request_latency_bucket[5m])) by (service_name)

# Models with no traffic in 30 days (cleanup candidates)
absent_over_time(revision_request_count[30d])

The third query is underrated. Stale models nobody calls still consume registry space and add complexity. Clean them up.


The DevOps Parallel

If you have run microservices at scale, you already know this pattern.

MicroservicesMulti-Model ML
Shared K8s clusterShared K8s cluster
Per-team namespacesPer-team namespaces
Service mesh routingIstio host-based routing
Independent scaling per serviceIndependent scaling per InferenceService
HPA based on CPU/memoryKPA based on concurrency

Same architecture. Different workload. The migration from “one cluster per service” to “one cluster, many services” took DevOps a decade. ML is going through the same transition right now.


The Maturity Path

Single model
    โ†“
Multiple models, isolated namespaces  (single-model trap)
    โ†“
Shared inference pool                 (multi-model architecture)
    โ†“
Per-team namespaces, self-service     (platform thinking)
    โ†“
Scale-to-zero at scale                (this post)
    โ†“
Full self-service ML platform         (next: MLOps Maturity Model)

(See Part 16: ML Governance for the model registry that makes this scale, and Part 22: ML Security for the RBAC + mTLS that keeps it safe.)


Quick Reference

ToolRole
KServePer-model InferenceService with KPA autoscaling
Knative ServingScale-to-zero + Activator for cold starts
IstioShared gateway + host/path routing
PrometheusPer-model metrics scraping
GrafanaMulti-tenant dashboards (cluster โ†’ team โ†’ model)

This is Part 23 of the MLOps for DevOps Engineers series. Hands-on Kubernetes and MLOps courses are available at stacksimplify.com/courses. For weekly updates, join the newsletter.

Share this article
K
Kalyan Reddy Daida

Instructor with 383,000+ students across 21 courses on AWS, Azure, GCP, Terraform, Kubernetes & DevOps. Sharing real-world patterns from production environments.

Enjoyed this? Get more in your inbox.

Weekly DevOps & Cloud insights from a 383K+ Udemy instructor