MLOps on StackSimplify | DevOps & Cloud Education by Kalyan Reddy

MLOps on StackSimplify | DevOps & Cloud Education by Kalyan Reddyhttps://stacksimplify.com/tags/mlops/Recent content in MLOps on StackSimplify | DevOps & Cloud Education by Kalyan ReddyHugo -- gohugo.ioen-usTue, 14 Apr 2026 00:00:00 +00005 Levels of ML Model Deployment on Kuberneteshttps://stacksimplify.com/blog/5-levels-ml-deployment/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/5-levels-ml-deployment/You deploy containers to Kubernetes every day. But how do you deploy ML models? There are 5 levels. Each adds production capabilities. Here’s the progression. The 5 Levels Level Pattern DevOps Equivalent When to Use L1 Baked Image Static binary in container Learning, simple models L2 MLflow Dynamic Config from external store Versioned, no rebuild L3 KServe Predictor Deployment + HPA + Ingress Scalable, zero downtime L4 KServe Transformer Sidecar pattern Modular, independent scaling L5 KServe Explainer Audit logging Compliance, GDPR Level 1: Baked Image Model baked into the Docker image at build time.5 Questions to Ask Before Every ML Model Deploymenthttps://stacksimplify.com/blog/ml-deployment-checklist/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/ml-deployment-checklist/A data scientist hands you a model.pkl and says “deploy this.” What do you ask? Most engineers jump straight to containers and endpoints. But the questions that save you at 2 AM are the ones you ask before deployment, not during an incident. The Checklist # Question Why It Matters 1 What input will break it? Models return garbage confidently on bad input 2 What’s the rollback plan? “Redeploy the old one” is not a plan 3 How do we know it’s broken?A/B Testing for ML Models: When Offline Metrics Liehttps://stacksimplify.com/blog/ab-testing-ml-models/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/ab-testing-ml-models/You retrained the model. Accuracy went up 2% on the test set. You deployed it. Revenue dropped 5%. What happened? Offline metrics lie. A model that scores better on historical data can score worse on real users. Canary vs A/B Testing Approach Question It Answers Traffic Split Canary “Does it break anything?” 10-20% to new model A/B Testing “Does it actually improve outcomes?” 50/50 to both models You need both. Canary first, then A/B.Canary Deployments for ML Models with KServe and Istiohttps://stacksimplify.com/blog/canary-rollouts-ml-models/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/canary-rollouts-ml-models/You do canary deployments for APIs every day. Why not for ML models? New model ready. Looks good in testing. Deploy to production. Hope it works. It doesn’t. Rollback takes 5 minutes. Five minutes of garbage predictions. Damage done. How It Works Role Traffic Description Champion (80%) Production traffic Current model, proven, stable Canary (20%) Test traffic New version, running alongside Both run simultaneously. Same endpoint. Istio handles the traffic split.CI/CD for ML: Same GitHub Actions, Different Artifacthttps://stacksimplify.com/blog/cicd-for-ml/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/cicd-for-ml/Your CI/CD pipeline deploys code. Ours deploys models. Same tools. GitHub Actions. ArgoCD. Docker. DVC. MLflow. Same stack you already run. The only difference is what triggers the pipeline and what gets deployed. Code pipeline: git push > build > test > deploy ML pipeline: data change > retrain > evaluate > deploy The 7-Job ML Pipeline Job What It Does Failure Action 0. Preflight 7 infra checks in 5 min (MLflow up?Data Drift Detection: When Your Model Stops Being Righthttps://stacksimplify.com/blog/data-drift-detection/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/data-drift-detection/Your model was trained on last year’s data. The world has moved on. Your model has not. Your model can return predictions with perfect latency, zero errors, 200 OK on every request. And every single prediction can be wrong. Operational monitoring tells you the model is running. Statistical monitoring tells you the model is still right. The Three Types of Drift Type What Changed Example Data Drift The inputs changed Model trained on ages 25-45, now seeing ages 18-22 Concept Drift The relationships changed High frequency used to mean fraud, now means power user Prediction Drift The outputs changed Fraud rate prediction jumped from 5% to 15% The DevOps Parallel Infrastructure monitoring: Is the server healthy?DevOps Thinking Applied to MLOps: 5 Essential Toolshttps://stacksimplify.com/blog/devops-thinking-mlops-tools/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/devops-thinking-mlops-tools/If you’re a DevOps engineer and a data scientist has ever handed you a model.pkl and said “deploy this”, you know the feeling. Where did this come from? What data trained it? Which version is this? How do I scale it? Here’s what I’ve learned after months building MLOps pipelines: these aren’t new problems. We’ve already solved them in DevOps. The tools are different, but the thinking is identical. The Mental Model: Same Problems, Different Artifacts Every MLOps challenge maps directly to a DevOps pattern you already understand:DVC: Git for Your ML Training Datahttps://stacksimplify.com/blog/dvc-data-version-control/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/dvc-data-version-control/You version code with Git. What about your model training data? If you’ve ever asked “Which dataset trained this model?” or “Can we reproduce last month’s model exactly?”, you need DVC. What DVC Solves Problem Without DVC With DVC Which dataset trained this model? “Check the shared drive, maybe?” git log shows exact data version Someone changed the training data No history, no diff dvc diff shows exactly what changed Reproduce last month’s model Impossible git checkout + dvc checkout Your Weekend Starter Six commands.Feature Stores: The Package Registry for ML Featureshttps://stacksimplify.com/blog/feature-stores-ml/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/feature-stores-ml/Your training pipeline computes “average transaction amount” as the mean of the last 30 days. Your inference API computes it as the mean of the last 7 days. Same feature name. Different values. Your model is silently wrong. This is training-serving skew. The number one silent killer of ML models in production. The Problem ML features get computed in two places: Context How Features Are Computed Problem Training Batch job on historical data, saved to CSV Code written by data scientist Serving API computes on the fly per request Different code, different logic Two separate implementations.ML Cost Optimization: One YAML Field Cut Our Bill by 80%https://stacksimplify.com/blog/ml-cost-optimization/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/ml-cost-optimization/We changed one YAML field from 1 to 0. Infrastructure cost dropped 80%. The field: minReplicas. When set to 1, your ML inference pod runs 24/7. Even at 3 AM when nobody is making predictions. That’s $50-150 per month per model, running idle. When set to 0, the pod scales to zero when idle. Traffic arrives, the pod spins up. Traffic stops, the pod disappears. You pay only for what you use.ML Governance: The Champion-Challenger Pattern for Model Deploymenthttps://stacksimplify.com/blog/ml-governance-model-registry/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/ml-governance-model-registry/Your ML serving code should never know about version numbers. Ever. If your inference service loads fraud-detector-v47, you have a problem. What happens when v48 is ready? Code change. New deploy. Downtime risk. Now imagine this: your service always loads the model tagged @champion. (MLflow Model Registry docs) When v48 is promoted, the tag moves. Next request gets the new model. Zero code changes. Zero downtime. The Champion-Challenger Pattern Role Alias Purpose Champion @champion Currently serving production traffic Challenger @candidate Being evaluated against the champion The flow:ML Model Monitoring: Your Grafana Dashboard Is Lying to Youhttps://stacksimplify.com/blog/ml-model-monitoring/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/ml-model-monitoring/Your ML model was 95% accurate when you deployed it. That was 6 months ago. Nobody has checked since. A model can show 10% CPU, zero errors, healthy pod status. And still return garbage predictions. Your Grafana dashboard shows all green. Your customers see wrong results. Why This Happens Your monitoring tracks CPU, memory, and pod restarts. Your model cares about none of that. Models degrade because the world changes:ML Pipeline Orchestration with Kubeflow on Kuberneteshttps://stacksimplify.com/blog/kubeflow-pipelines-orchestration/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/kubeflow-pipelines-orchestration/Your ML team has 47 Jupyter notebooks. 12 of them “should run in order.” Nobody remembers which 12. One fetches data. Another cleans it. A third trains. A fourth evaluates. A fifth deploys. Different repos. Hardcoded paths. Two only work on Sarah’s laptop. This is not a pipeline. This is a disaster waiting for a deadline. Why ML Pipelines Are Different Data pipelines move data from A to B. ETL. Airflow handles this well.ML Retraining Pipelines: From Drift Alert to Production Modelhttps://stacksimplify.com/blog/ml-retraining-pipelines/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/ml-retraining-pipelines/Your drift detector triggered an alert. Now what? Most teams freeze. The runbook says “retrain the model.” Nobody knows how. Monitoring without a retraining pipeline is like alerting without a runbook. The Retraining Spectrum Level Trigger Best For Manual Data scientist retrains in a notebook Small teams, low-risk models Scheduled Cron job retrains every week/month Predictable drift patterns Triggered Drift detector kicks off pipeline automatically High-value models Most teams should start with manual.MLflow in 60 Seconds: The Complete ML Model Lifecyclehttps://stacksimplify.com/blog/mlflow-model-lifecycle/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/mlflow-model-lifecycle/How does an ML model actually get from training to production? If you’re a DevOps engineer stepping into MLOps, MLflow is the first tool you need to understand. It handles the entire lifecycle: tracking experiments, versioning models, and serving them in production. The 5-Step Lifecycle Here’s the full journey of a model, from code to production. Step What Happens DevOps Analogy Experiment Write training code, MLflow creates a “run” Starting a CI build Run Logs parameters, metrics, model files Build artifacts + test results Model Best run registered to Model Registry Pushing image to Container Registry Registry Versions (v1, v2, v3) with aliases (@champion, @candidate) Image tags (:latest, :staging, :prod) Serving API loads models:/fraud-detector@champion K8s Deployment pulling :prod tag Step 1: Experiment You write training code and run it.Scale-to-Zero for ML Models: Stop Paying for Idle Computehttps://stacksimplify.com/blog/scale-to-zero-ml-models/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/scale-to-zero-ml-models/Your ML model runs 24/7. Inference requests come 2% of the time. You’re paying for 98% idle compute. This is the most expensive mistake in ML deployment. And the fix takes one YAML field. How It Works KServe + Knative handles this natively. Your model is serving requests Traffic drops. 30 seconds of silence Knative scales pods to ZERO New request arrives Pod spins up in seconds. Request served. Zero requests = zero pods = zero cost.SHAP Explainability: Why Your ML Model Flagged That Transactionhttps://stacksimplify.com/blog/shap-explainability-ml/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/shap-explainability-ml/Your ML model flagged a customer’s transaction. They call support and ask: “Why?” If you can’t answer, you might be breaking the law. GDPR Article 22 gives users the right to an explanation for automated decisions. Financial regulators require it. Healthcare demands it. The Explanation Instead of just HIGH RISK: 0.85, you get: Feature SHAP Value Impact Amount 5x higher than average +0.32 Increases risk International from unusual country +0.21 Increases risk Transaction at 3 AM local time +0.The Two-Container Pattern: Transformer + Predictor for ML Servinghttps://stacksimplify.com/blog/transformer-predictor-pattern/Tue, 14 Apr 2026 00:00:00 +0000https://stacksimplify.com/blog/transformer-predictor-pattern/Your ML model expects clean features. Your API receives raw data. Where does the preprocessing live? Every team gets this wrong the first time. They stuff everything into one container: data validation, feature engineering, ML inference, output formatting. It works. Until it doesn’t. The Problem with One Container Model retrained? Rebuild the whole container. Feature logic changed? Rebuild the whole container. Need to scale inference independently? Everything scales together. Or breaks together.MLOps for DevOps Engineershttps://stacksimplify.com/blog/mlops-series/Mon, 01 Jan 0001 00:00:00 +0000https://stacksimplify.com/blog/mlops-series/