Quality Gates for ML: 4 Layers Between Training and Production
40% of candidate models got rejected at the quality gate. That is not a failure rate. That is a protection rate. Four layers that stop bad models.
40% of our candidate models got rejected at the quality gate. That is not a failure rate. That is a protection rate.
Without quality gates, every model that finishes training goes to production. Good models. Bad models. Models trained on corrupted data. Models that score well on the test set but tank in production.
Quality gates ask one question before every deployment: is this model actually better than what we have?

Four Layers of Quality Gates
Each layer catches what the previous one missed. Defense in depth for ML.
| Layer | Gate | Decision |
|---|---|---|
| 1. Registration Gate | Net Benefit > $40K AND F1 > 0.20 | Register or Reject |
| 2. Champion-Challenger | Candidate strictly beats champion on fixed test set | Promote or Hold |
| 3. CI/CD Pipeline Gate | Job 2 comparison result | Deploy or Skip Jobs 3-6 |
| 4. Canary Gate | Error rate, latency, prediction anomalies | Promote or Rollback |
The Fixed Test Set Rule
Every comparison uses the same test set (random_state=42). If you let the test set change between evaluations, you are comparing data, not models.
Different test sets between champion and candidate? You are not comparing models. You are comparing data.
The fixed test set is the scientific control. Without it, your quality gate is theater.
Layer 1: Registration Gate
A model that does not pass minimum thresholds never enters the MLflow Model Registry:
| |
The registry stays clean. No experimental garbage polluting production aliases. Only models worth comparing ever get registered.
Layer 2: Champion vs Challenger
The candidate must be strictly better than the current champion. Ties go to the champion (less complexity, already validated).
| |
Same test set. Business metric. Strict comparison. (See Part 16: ML Governance for the alias pattern.)
Layer 3: The CI/CD Pipeline Gate
The single most powerful gate. GitHub Actions runs the full pipeline. Job 2 trains and compares. If the candidate loses, Jobs 3-6 are skipped:
| |
When the candidate loses:
- Workflow status: SUCCESS (green checkmark)
- Jobs 3-6: SKIPPED
- Production: UNCHANGED
- Champion: STILL SERVING
The pipeline did not fail. The pipeline protected production. A skipped deploy is a successful quality gate.
Layer 4: Canary Gate
Model passed all three gates. Deploy via canary with KServe (80/20 split). Monitor errors, latency, prediction distribution. If anything breaks, automatic rollback.
The previous layers compare on a test set. The canary layer tests against real production traffic. That is the final truth.
The DevOps Parallel
You already have quality gates. You just call them something else.
| DevOps Gate | ML Gate |
|---|---|
| CI tests must pass before merge | Registration gate (min thresholds) |
| Code review required | Champion-challenger comparison |
| Staging smoke tests | CI/CD pipeline gate (Job 2) |
| Canary deployment with rollback | Canary gate with auto-rollback |
Same mechanism. Different metric. Net benefit instead of test coverage.
Five Anti-Patterns That Break Quality Gates
| Anti-Pattern | Why It Fails | Fix |
|---|---|---|
| Different test sets | Comparing data, not models | Fixed random_state=42 |
| No business metrics | F1 up, revenue down | Include at least one business metric |
| Promote on tie | Churn for no gain | Strict improvement required |
| Skip gate in emergencies | Bypass becomes the norm | Fast-track gate, never skip |
| Gate without monitoring | Bad models slip past canary | Layer 4 is not optional |
When the Gate Fails: What To Do
A rejected model is the system working. Then:
- Diagnose the comparison. Champion $45K vs Candidate $38K? Why did net benefit drop?
- Check the data.
dvc difffor training data changes. Class imbalance shift? - Check the model. Confusion matrix side-by-side. Threshold sensitivity shifted?
- Decide. Bad data? Fix the pipeline. Genuinely worse? Champion stays.
The gate is not the enemy. Bad models in production are the enemy.
Quick Reference
| Tool | Role |
|---|---|
| MLflow Model Registry | Aliases (@champion, @candidate) |
| Kubeflow Pipelines | Quality gate component in the DAG |
| GitHub Actions | Conditional deploy based on gate output |
| KServe | Canary traffic split for Layer 4 |
| DVC | Detect training data changes when gate fails |
This is Part 19 of the MLOps for DevOps Engineers series. Hands-on MLOps courses are available at stacksimplify.com/courses. For weekly updates, join the newsletter.