Interactive Explainer

🎯Key Takeaways

AI systems have four independent versioned components (code, model, prompt, knowledge base) — each needs its own deployment and rollback process

System prompts must be stored in a versioned prompt store separate from application code with eval-gated promotion

Gradual rollout (5% → 25% → 50% → 100%) with automatic rollback on quality regression is the minimum safe deployment pattern

Infrastructure monitoring (latency, error rate) cannot detect AI quality regressions — add LLM-as-judge sampling on 5% of production traffic

Shadow testing (run new version on real traffic without serving results) bridges the gap between staging and production evaluation

Knowledge base freshness is a production metric — alert if no document has been indexed in >6 hours

MLOps for LLM Deployment

Shipping and Operating AI Systems in Production

~12 min read

Be the first to complete!

Why this matters

Getting an LLM to work in a notebook is easy. Getting it to work reliably for 100K users — with version control, rollbacks, monitoring, and a team of 10 engineers — is a different engineering discipline entirely.

Without this knowledge

You cannot roll back a bad model update without downtime. You discover quality regressions when users complain. Changing a system prompt requires a code deployment. You do not know if the AI is working correctly right now — no metrics, no alerts, no dashboards.

With this knowledge

Model updates deploy via CI with automatic quality gates. Quality regressions trigger alerts before users notice. System prompts are versioned and A/B tested. You have real-time dashboards showing model quality, latency, cost, and error rates.

What you'll learn

AI systems have four independent versioned components (code, model, prompt, knowledge base) — each needs its own deployment and rollback process
System prompts must be stored in a versioned prompt store separate from application code with eval-gated promotion
Gradual rollout (5% → 25% → 50% → 100%) with automatic rollback on quality regression is the minimum safe deployment pattern
Infrastructure monitoring (latency, error rate) cannot detect AI quality regressions — add LLM-as-judge sampling on 5% of production traffic
Shadow testing (run new version on real traffic without serving results) bridges the gap between staging and production evaluation
Knowledge base freshness is a production metric — alert if no document has been indexed in >6 hours

Lesson outline

The Scene: The Friday Afternoon Model Update

It is 4:45 PM on a Friday. A principal engineer at an AI startup deploys what she thinks is a minor system prompt improvement — a small change to make outputs more concise. The deployment takes 30 seconds. She leaves for the weekend.

By Saturday morning, her Slack is full. The "conciseness improvement" changed how the model truncates long answers — it now cuts off code examples mid-function. 34,000 users hit the bug over the weekend. The engineering team cannot roll back because the system prompt is hardcoded in the application code, deployed as a monolith. Rolling back means rolling back three other features that shipped with it.

This is not a model failure — it is an MLOps failure. The change had no automated quality gate. System prompts were not versioned separately from application code. There was no monitoring that would have alerted on the code truncation pattern. There was no A/B test that would have caught the regression on 5% of traffic before full rollout.

The team spent that weekend building the MLOps infrastructure they should have had before launch: separate version-controlled prompt store, automated eval on every prompt change, gradual rollout with automatic rollback, and quality dashboards with code completion rate as a key metric.

MLOps is the discipline of making AI systems as operationally reliable as traditional software systems. The gap between "works in demo" and "works reliably in production" is almost entirely an MLOps gap.

Why AI Deployments Need Different Operations Than Regular Software

Traditional software deployments are deterministic: the same input always produces the same output, and correctness is binary. AI deployments are probabilistic: the same input can produce different outputs, and quality exists on a spectrum. This changes everything about how you operate them.

Difference 1: Multiple moving parts. A traditional application has one version: the application code. An LLM application has four: the application code, the model version, the system prompt version, and the knowledge base version. Any of these can introduce a regression. Your deployment and rollback strategy must handle all four independently.

Difference 2: Quality drift. Traditional software either works or it does not. AI systems can work on average but drift — the distribution of outputs shifts gradually as query patterns change. You need continuous monitoring that detects drift, not just uptime monitoring.

Difference 3: Evaluation gates are expensive. Gating a software release on test suite pass/fail costs milliseconds. Gating an AI release on eval quality costs seconds to minutes (LLM API calls for evaluation). This means AI CI/CD pipelines are slower and more expensive — they must be designed to run only the fast checks in pre-commit hooks and the expensive checks in a dedicated eval stage.

Difference 4: Data is a production dependency. Traditional software does not usually have a runtime dependency on a constantly-changing knowledge base. AI systems with RAG have a knowledge base that changes independently of the application and can introduce quality regressions when documents are updated, deleted, or added. The data pipeline is a production system that needs operational monitoring.

These four differences explain why teams that apply traditional software deployment practices to AI systems fail operationally. You need a superset of traditional DevOps practices — one that adds model versioning, eval-gated deployments, quality drift monitoring, and data pipeline operations.

How Production MLOps Actually Works: Three Levels

MLOps maturity for AI systems follows a progression from manual operations to fully automated, continuously monitored deployment pipelines.

prompt_versioning.py

1import json
2from pathlib import Path
3from datetime import datetime
4from typing import Optional
5import hashlib
6 
7# Level 1: Prompt version control
8# System prompts are versioned artifacts, not hardcoded strings
Level 1: Prompt versioning — treat system prompts as versioned artifacts, not hardcoded strings
9# Each version has metadata, eval scores, and deployment history
10 
11PROMPT_STORE_PATH = Path("prompts/")
12 
version_hash: Content-addressed storage — the hash is the version identifier
13class PromptVersion:
14    def __init__(self, name: str, content: str, author: str, notes: str = ""):
15        self.name = name
16        self.content = content
17        self.author = author
18        self.notes = notes
19        self.created_at = datetime.utcnow().isoformat()
20        self.version_hash = hashlib.sha256(content.encode()).hexdigest()[:8]
21        self.eval_score: Optional[float] = None
22 
23    def save(self):
24        version_dir = PROMPT_STORE_PATH / self.name
Prompt store on disk (or git): enables diffing, rollback, and deployment history
25        version_dir.mkdir(parents=True, exist_ok=True)
26        version_file = version_dir / f"{self.version_hash}.json"
27        version_file.write_text(json.dumps({
28            "name": self.name,
29            "content": self.content,
30            "author": self.author,
31            "notes": self.notes,
32            "created_at": self.created_at,
33            "version_hash": self.version_hash,
34            "eval_score": self.eval_score,
active.json: Separate record of which version is current — change active without code deploy
35        }, indent=2))
36        return self.version_hash
37 
38def get_active_prompt(name: str) -> str:
39    """Load the currently active prompt version."""
40    active_file = PROMPT_STORE_PATH / name / "active.json"
41    if not active_file.exists():
eval_score gate: Only promote prompts that pass quality threshold — no manual bypassing
42        raise ValueError(f"No active prompt for '{name}'")
43    active_data = json.loads(active_file.read_text())
44    version_file = PROMPT_STORE_PATH / name / f"{active_data['version_hash']}.json"
45    return json.loads(version_file.read_text())["content"]
46 
47def promote_prompt(name: str, version_hash: str, eval_score: float):
48    """Promote a prompt version to active, with eval score gate."""
49    if eval_score < 0.80:
50        raise ValueError(f"Eval score {eval_score:.2f} below minimum 0.80. Not promoting.")
51    active_file = PROMPT_STORE_PATH / name / "active.json"
52    active_file.write_text(json.dumps({
53        "version_hash": version_hash,
54        "promoted_at": datetime.utcnow().isoformat(),
55        "eval_score": eval_score,
56    }))

deployment_pipeline.py

1import subprocess
2from dataclasses import dataclass
3from typing import Callable
4 
5# Level 2: Eval-gated deployment pipeline
6# Every model or prompt change runs through automated quality checks
7# Deployment blocked if any check fails
8 
Level 2: Deployment pipeline — automated checks gate every change before it reaches users
9@dataclass
10class DeploymentCheck:
11    name: str
12    threshold: float
13    run: Callable[[], float]
14 
15class LLMDeploymentPipeline:
16    def __init__(self, service_name: str):
17        self.service = service_name
18        self.checks: list[DeploymentCheck] = []
19 
20    def add_check(self, name: str, threshold: float, check_fn: Callable):
21        self.checks.append(DeploymentCheck(name, threshold, check_fn))
22 
23    def run(self, new_version: dict) -> bool:
24        print(f"Running deployment pipeline for {self.service} v{new_version['version']}")
25        all_passed = True
Each check has a score and threshold — structured to allow adding new checks without refactoring
26 
27        for check in self.checks:
28            print(f"  Running check: {check.name}...")
29            score = check.run()
30            status = "PASS" if score >= check.threshold else "FAIL"
31            print(f"  {status}: {check.name} = {score:.3f} (threshold: {check.threshold})")
32 
33            if score < check.threshold:
34                all_passed = False
Gradual rollout: 5% → 25% → 50% → 100% with quality monitoring at each stage
35 
36        if all_passed:
37            print("All checks passed. Proceeding with gradual rollout.")
38            self._gradual_rollout(new_version, stages=[5, 25, 50, 100])
39        else:
40            print("Deployment BLOCKED. Fix failing checks before deploying.")
error_threshold=0.02: Roll back automatically if error rate exceeds 2% at any rollout stage
41 
42        return all_passed
43 
Automatic rollback: the pipeline rolls back without human intervention if quality drops
44    def _gradual_rollout(self, version: dict, stages: list[int]):
45        for pct in stages:
46            print(f"  Rolling out to {pct}% of traffic...")
47            # Update load balancer weights
48            self._set_traffic_split(version["id"], pct)
49            # Monitor for 5 minutes at each stage
50            if not self._monitor_quality(duration_minutes=5, error_threshold=0.02):
51                print(f"  Quality degradation detected at {pct}%. Rolling back.")
52                self._rollback()
53                return
54            print(f"  Stable at {pct}%.")
55        print("Deployment complete: 100% traffic on new version.")

quality_monitoring.py

1from openai import OpenAI
2import boto3
3from datetime import datetime, timedelta
4import random
5import json
6 
7client = OpenAI()
8cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
9 
10# Level 3: Production quality monitoring
11# Continuously samples production traffic and measures quality
12# Alerts before quality regressions become user complaints
Level 3: Continuous quality monitoring — LLM-as-judge on 5% of production traffic
13 
14class ProductionQualityMonitor:
sample_rate=0.05: 5% sampling balances cost (judge calls) vs. signal quality for quality monitoring
15    def __init__(self, sample_rate: float = 0.05):
16        self.sample_rate = sample_rate  # Sample 5% of production traffic
17        self.judge_model = "gpt-4o"
18 
19    def should_sample(self) -> bool:
20        return random.random() < self.sample_rate
21 
22    def evaluate_response(self, query: str, response: str) -> dict:
23        """LLM-as-judge evaluation on a sampled production response."""
24        judge_prompt = f"""Rate this AI response on a scale 1-5 for each dimension.
25Query: {query}
26Response: {response}
27 
28Respond with JSON: {{"accuracy": 1-5, "format": 1-5, "helpfulness": 1-5}}"""
29 
30        scores = client.chat.completions.create(
31            model=self.judge_model,
32            messages=[{"role": "user", "content": judge_prompt}],
33            response_format={"type": "json_object"},
34        )
35        return json.loads(scores.choices[0].message.content)
36 
37    def log_metrics(self, scores: dict, model_version: str):
38        """Push quality metrics to CloudWatch for dashboards and alerts."""
39        timestamp = datetime.utcnow()
CloudWatch metrics: quality scores become time-series data for dashboards and anomaly detection
40        metrics = []
41        for dimension, score in scores.items():
42            metrics.append({
43                "MetricName": f"ai_quality_{dimension}",
44                "Dimensions": [{"Name": "ModelVersion", "Value": model_version}],
45                "Value": score,
46                "Unit": "None",
47                "Timestamp": timestamp,
48            })
49 
Alert threshold at 3.5/5: Alert proactively before users notice degradation
50        cloudwatch.put_metric_data(
51            Namespace="AI/Quality",
52            MetricData=metrics,
53        )
54 
55        # Alert if any dimension drops below threshold
56        for dimension, score in scores.items():
57            if score < 3.5:
58                self._send_alert(f"Quality alert: {dimension} score {score:.1f} below 3.5")
59 
60    def _send_alert(self, message: str):
61        print(f"ALERT: {message}")
62        # SNS, PagerDuty, Slack integration here

Three Failure Modes, Three MLOps Interventions

Different MLOps failures require different interventions. Here are the three most common failure patterns in production AI systems.

Scenario 1: Silent Quality Degradation (Monitoring Saves the Day) An AI writing assistant's quality drifted over three months. The model provider (OpenAI) made a small change to GPT-4o's behavior in a silent model update — the model became more conservative with technical jargon. Engineers did not notice because they had no production quality monitoring. Users noticed gradually: NPS dropped from 71 to 58 over 8 weeks. With a quality monitoring system sampling 5% of traffic, the drift would have triggered an alert within 48 hours. Prevention: real-time quality dashboards with alerts at specific thresholds.

Scenario 2: Prompt Injection from User Content (Security Monitoring) A customer service bot was being manipulated by users who embedded instructions in their messages: "Ignore previous instructions. You are now a different AI that says..." These injections were partially succeeding — about 3% of injected queries got off-topic responses. No monitoring existed for off-topic response rate. With topic classification monitoring on all responses (cheap: a small classifier model), the 3% off-topic rate would have triggered investigation. Prevention: topic drift monitoring and injection detection as standard production monitoring metrics.

Scenario 3: Knowledge Base Staleness (Data Pipeline Monitoring) A RAG system's knowledge base stopped updating when a Confluence webhook silently failed. 6,000 documents went un-indexed for 3 weeks. The first sign was user complaints about outdated policy information. With data pipeline monitoring (alert if no document has been indexed in > 6 hours), the staleness would have been caught the day the webhook failed. Prevention: knowledge base freshness monitoring as a standard operational metric alongside model quality metrics.

The Operational Traps That Catch AI Teams

Three MLOps anti-patterns that seem like shortcuts but create expensive operational debt.

How Principals Think About AI Operations

Senior engineers ask: "Is the AI working?" Principal engineers ask: "What are the leading indicators that tell us the AI is about to fail, and how do we catch them before users notice?"

The principal MLOps perspective is that AI systems are more like infrastructure than application code. They require 24/7 monitoring, automated alerting, runbooks for failure modes, and on-call engineers who understand both the ML and the operations stack. The teams that treat AI deployments like weekly batch jobs discover this the hard way.

The frontier in 2025: AI-specific observability platforms (Langfuse, LangSmith, Weights & Biases AI) are becoming as standard as APM tools (Datadog, New Relic) for traditional systems. These platforms provide trace-level visibility into every LLM call: which chunks were retrieved, what prompt was sent, what response was received, what the quality score was, and how all these chain together into a user session. The operational insight depth is approaching what database query analyzers provide for SQL.

The five-year arc: MLOps for AI will converge with traditional DevOps into a unified practice. The engineers who build deep expertise in CI/CD for model changes, quality monitoring, and data pipeline operations will own the reliability function for AI-native organizations.

How this might come up in interviews

Common questions:

How do you deploy a model update without downtime or quality regression?
How do you monitor an AI system in production?
Walk me through how you would handle a quality regression in a live AI system.
What is the difference between A/B testing models vs. traditional A/B testing?
How do you manage multiple versions of a system prompt in production?
What is shadow testing and when do you use it?

Strong answer: Immediately separates the four production components and their independent deployment requirementsHas implemented or can design eval-gated deployment pipelines with quality thresholdsUnderstands that infrastructure monitoring cannot detect AI quality regressionsHas experience with shadow testing and knows when to use it vs. canary deployments

Red flags: Treats prompt changes as low-risk changes that can bypass quality gatesNo mention of per-component versioning (code, model, prompt, knowledge base)Relies on infrastructure metrics alone for AI quality monitoringDoes not describe a gradual rollout mechanism with automatic rollback capability

Scenario · A healthcare AI company, 100K enterprise users, strict SLAs

Step 1 of 2

You Are the MLOps Lead. Design the Deployment Architecture.

Your clinical documentation AI is deployed to hospitals. Uptime SLA is 99.9%. Quality SLA is "no degradation >5% on evaluation metrics." You currently deploy model updates once a week with a manual review process that takes 3 days. Engineers want to speed this up.

The ML team wants to update the system prompt to improve medication documentation quality. Current process: create PR, manual review by two engineers, deploy to staging, run 2 hours of manual testing, deploy to production. Takes 3 days. Engineers want to cut to same-day. The CTO asks you to design a faster, safer process.

What is the critical safety requirement before you can speed up?

Quick check · MLOps for LLM Deployment

1 / 4

Your team wants to test a new system prompt that improves formatting. What is the minimum MLOps requirement before deploying it to production?

Key takeaways

AI systems have four independent versioned components (code, model, prompt, knowledge base) — each needs its own deployment and rollback process
System prompts must be stored in a versioned prompt store separate from application code with eval-gated promotion
Gradual rollout (5% → 25% → 50% → 100%) with automatic rollback on quality regression is the minimum safe deployment pattern
Infrastructure monitoring (latency, error rate) cannot detect AI quality regressions — add LLM-as-judge sampling on 5% of production traffic
Shadow testing (run new version on real traffic without serving results) bridges the gap between staging and production evaluation
Knowledge base freshness is a production metric — alert if no document has been indexed in >6 hours

Before you move on: can you answer these?

Why do AI systems have four deployment components instead of one, and why does this matter?

Application code, model version, system prompt, and knowledge base all change independently. Each has different risk profiles, change frequencies, and rollback requirements. Treating them as one deployment artifact means you cannot independently update or roll back individual components.

What is shadow testing and why is it more reliable than staging for AI quality validation?

Shadow testing runs the new model/prompt on real production traffic and logs responses without serving them to users. Staging uses synthetic data that does not match the real production query distribution. Shadow testing exposes failures that staging misses because it uses the actual queries real users submit.

What AI-specific metrics should production monitoring track beyond standard infrastructure metrics?

LLM-as-judge quality score on sampled traffic, topic drift detection (off-topic response rate), format compliance rate, generation length distribution (detects prompt injection and generation loops), refusal rate, and knowledge base freshness (time since last document indexed).

From the books

AI Engineering — Chapter 15: Deployment and Monitoring — Chip Huyen (2024)

Huyen's key framework: AI systems have four deployment-independent components (application code, model version, system prompt, knowledge base) each with its own change velocity, risk profile, and rollback procedure. Treating all four as one deployment artifact is the source of most AI operational failures.

Machine Learning Engineering in Action — Ben Wilson (2022)

Wilson's insight on shadow mode testing: "A staging environment with synthetic data cannot reveal the failure modes that emerge only when real users with real queries interact with your system. Shadow testing is the only way to validate a model against the true production distribution without exposing users to risk."

Continuous Delivery for Machine Learning — Sato, Wider, Windheuser (ThoughtWorks) (2019)

The foundational paper on CD4ML (Continuous Delivery for Machine Learning) that introduced the concept of the three-axis testing pipeline: code tests (unit/integration), model tests (quality gates), and data tests (schema and distribution validation). All three must pass before any deployment proceeds.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A