Shipping and Operating AI Systems in Production
Shipping and Operating AI Systems in Production
Getting an LLM to work in a notebook is easy. Getting it to work reliably for 100K users — with version control, rollbacks, monitoring, and a team of 10 engineers — is a different engineering discipline entirely.
You cannot roll back a bad model update without downtime. You discover quality regressions when users complain. Changing a system prompt requires a code deployment. You do not know if the AI is working correctly right now — no metrics, no alerts, no dashboards.
Model updates deploy via CI with automatic quality gates. Quality regressions trigger alerts before users notice. System prompts are versioned and A/B tested. You have real-time dashboards showing model quality, latency, cost, and error rates.
Lesson outline
It is 4:45 PM on a Friday. A principal engineer at an AI startup deploys what she thinks is a minor system prompt improvement — a small change to make outputs more concise. The deployment takes 30 seconds. She leaves for the weekend.
By Saturday morning, her Slack is full. The "conciseness improvement" changed how the model truncates long answers — it now cuts off code examples mid-function. 34,000 users hit the bug over the weekend. The engineering team cannot roll back because the system prompt is hardcoded in the application code, deployed as a monolith. Rolling back means rolling back three other features that shipped with it.
This is not a model failure — it is an MLOps failure. The change had no automated quality gate. System prompts were not versioned separately from application code. There was no monitoring that would have alerted on the code truncation pattern. There was no A/B test that would have caught the regression on 5% of traffic before full rollout.
The team spent that weekend building the MLOps infrastructure they should have had before launch: separate version-controlled prompt store, automated eval on every prompt change, gradual rollout with automatic rollback, and quality dashboards with code completion rate as a key metric.
MLOps is the discipline of making AI systems as operationally reliable as traditional software systems. The gap between "works in demo" and "works reliably in production" is almost entirely an MLOps gap.
Traditional software deployments are deterministic: the same input always produces the same output, and correctness is binary. AI deployments are probabilistic: the same input can produce different outputs, and quality exists on a spectrum. This changes everything about how you operate them.
Difference 1: Multiple moving parts. A traditional application has one version: the application code. An LLM application has four: the application code, the model version, the system prompt version, and the knowledge base version. Any of these can introduce a regression. Your deployment and rollback strategy must handle all four independently.
Difference 2: Quality drift. Traditional software either works or it does not. AI systems can work on average but drift — the distribution of outputs shifts gradually as query patterns change. You need continuous monitoring that detects drift, not just uptime monitoring.
Difference 3: Evaluation gates are expensive. Gating a software release on test suite pass/fail costs milliseconds. Gating an AI release on eval quality costs seconds to minutes (LLM API calls for evaluation). This means AI CI/CD pipelines are slower and more expensive — they must be designed to run only the fast checks in pre-commit hooks and the expensive checks in a dedicated eval stage.
Difference 4: Data is a production dependency. Traditional software does not usually have a runtime dependency on a constantly-changing knowledge base. AI systems with RAG have a knowledge base that changes independently of the application and can introduce quality regressions when documents are updated, deleted, or added. The data pipeline is a production system that needs operational monitoring.
These four differences explain why teams that apply traditional software deployment practices to AI systems fail operationally. You need a superset of traditional DevOps practices — one that adds model versioning, eval-gated deployments, quality drift monitoring, and data pipeline operations.
MLOps maturity for AI systems follows a progression from manual operations to fully automated, continuously monitored deployment pipelines.
1import json2from pathlib import Path3from datetime import datetime4from typing import Optional5import hashlib67# Level 1: Prompt version control8# System prompts are versioned artifacts, not hardcoded stringsLevel 1: Prompt versioning — treat system prompts as versioned artifacts, not hardcoded strings9# Each version has metadata, eval scores, and deployment history1011PROMPT_STORE_PATH = Path("prompts/")12version_hash: Content-addressed storage — the hash is the version identifier13class PromptVersion:14def __init__(self, name: str, content: str, author: str, notes: str = ""):15self.name = name16self.content = content17self.author = author18self.notes = notes19self.created_at = datetime.utcnow().isoformat()20self.version_hash = hashlib.sha256(content.encode()).hexdigest()[:8]21self.eval_score: Optional[float] = None2223def save(self):24version_dir = PROMPT_STORE_PATH / self.namePrompt store on disk (or git): enables diffing, rollback, and deployment history25version_dir.mkdir(parents=True, exist_ok=True)26version_file = version_dir / f"{self.version_hash}.json"27version_file.write_text(json.dumps({28"name": self.name,29"content": self.content,30"author": self.author,31"notes": self.notes,32"created_at": self.created_at,33"version_hash": self.version_hash,34"eval_score": self.eval_score,active.json: Separate record of which version is current — change active without code deploy35}, indent=2))36return self.version_hash3738def get_active_prompt(name: str) -> str:39"""Load the currently active prompt version."""40active_file = PROMPT_STORE_PATH / name / "active.json"41if not active_file.exists():eval_score gate: Only promote prompts that pass quality threshold — no manual bypassing42raise ValueError(f"No active prompt for '{name}'")43active_data = json.loads(active_file.read_text())44version_file = PROMPT_STORE_PATH / name / f"{active_data['version_hash']}.json"45return json.loads(version_file.read_text())["content"]4647def promote_prompt(name: str, version_hash: str, eval_score: float):48"""Promote a prompt version to active, with eval score gate."""49if eval_score < 0.80:50raise ValueError(f"Eval score {eval_score:.2f} below minimum 0.80. Not promoting.")51active_file = PROMPT_STORE_PATH / name / "active.json"52active_file.write_text(json.dumps({53"version_hash": version_hash,54"promoted_at": datetime.utcnow().isoformat(),55"eval_score": eval_score,56}))
1import subprocess2from dataclasses import dataclass3from typing import Callable45# Level 2: Eval-gated deployment pipeline6# Every model or prompt change runs through automated quality checks7# Deployment blocked if any check fails8Level 2: Deployment pipeline — automated checks gate every change before it reaches users9@dataclass10class DeploymentCheck:11name: str12threshold: float13run: Callable[[], float]1415class LLMDeploymentPipeline:16def __init__(self, service_name: str):17self.service = service_name18self.checks: list[DeploymentCheck] = []1920def add_check(self, name: str, threshold: float, check_fn: Callable):21self.checks.append(DeploymentCheck(name, threshold, check_fn))2223def run(self, new_version: dict) -> bool:24print(f"Running deployment pipeline for {self.service} v{new_version['version']}")25all_passed = TrueEach check has a score and threshold — structured to allow adding new checks without refactoring2627for check in self.checks:28print(f" Running check: {check.name}...")29score = check.run()30status = "PASS" if score >= check.threshold else "FAIL"31print(f" {status}: {check.name} = {score:.3f} (threshold: {check.threshold})")3233if score < check.threshold:34all_passed = FalseGradual rollout: 5% → 25% → 50% → 100% with quality monitoring at each stage3536if all_passed:37print("All checks passed. Proceeding with gradual rollout.")38self._gradual_rollout(new_version, stages=[5, 25, 50, 100])39else:40print("Deployment BLOCKED. Fix failing checks before deploying.")error_threshold=0.02: Roll back automatically if error rate exceeds 2% at any rollout stage4142return all_passed43Automatic rollback: the pipeline rolls back without human intervention if quality drops44def _gradual_rollout(self, version: dict, stages: list[int]):45for pct in stages:46print(f" Rolling out to {pct}% of traffic...")47# Update load balancer weights48self._set_traffic_split(version["id"], pct)49# Monitor for 5 minutes at each stage50if not self._monitor_quality(duration_minutes=5, error_threshold=0.02):51print(f" Quality degradation detected at {pct}%. Rolling back.")52self._rollback()53return54print(f" Stable at {pct}%.")55print("Deployment complete: 100% traffic on new version.")
1from openai import OpenAI2import boto33from datetime import datetime, timedelta4import random5import json67client = OpenAI()8cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")910# Level 3: Production quality monitoring11# Continuously samples production traffic and measures quality12# Alerts before quality regressions become user complaintsLevel 3: Continuous quality monitoring — LLM-as-judge on 5% of production traffic1314class ProductionQualityMonitor:sample_rate=0.05: 5% sampling balances cost (judge calls) vs. signal quality for quality monitoring15def __init__(self, sample_rate: float = 0.05):16self.sample_rate = sample_rate # Sample 5% of production traffic17self.judge_model = "gpt-4o"1819def should_sample(self) -> bool:20return random.random() < self.sample_rate2122def evaluate_response(self, query: str, response: str) -> dict:23"""LLM-as-judge evaluation on a sampled production response."""24judge_prompt = f"""Rate this AI response on a scale 1-5 for each dimension.25Query: {query}26Response: {response}2728Respond with JSON: {{"accuracy": 1-5, "format": 1-5, "helpfulness": 1-5}}"""2930scores = client.chat.completions.create(31model=self.judge_model,32messages=[{"role": "user", "content": judge_prompt}],33response_format={"type": "json_object"},34)35return json.loads(scores.choices[0].message.content)3637def log_metrics(self, scores: dict, model_version: str):38"""Push quality metrics to CloudWatch for dashboards and alerts."""39timestamp = datetime.utcnow()CloudWatch metrics: quality scores become time-series data for dashboards and anomaly detection40metrics = []41for dimension, score in scores.items():42metrics.append({43"MetricName": f"ai_quality_{dimension}",44"Dimensions": [{"Name": "ModelVersion", "Value": model_version}],45"Value": score,46"Unit": "None",47"Timestamp": timestamp,48})49Alert threshold at 3.5/5: Alert proactively before users notice degradation50cloudwatch.put_metric_data(51Namespace="AI/Quality",52MetricData=metrics,53)5455# Alert if any dimension drops below threshold56for dimension, score in scores.items():57if score < 3.5:58self._send_alert(f"Quality alert: {dimension} score {score:.1f} below 3.5")5960def _send_alert(self, message: str):61print(f"ALERT: {message}")62# SNS, PagerDuty, Slack integration here
Different MLOps failures require different interventions. Here are the three most common failure patterns in production AI systems.
Scenario 1: Silent Quality Degradation (Monitoring Saves the Day) An AI writing assistant's quality drifted over three months. The model provider (OpenAI) made a small change to GPT-4o's behavior in a silent model update — the model became more conservative with technical jargon. Engineers did not notice because they had no production quality monitoring. Users noticed gradually: NPS dropped from 71 to 58 over 8 weeks. With a quality monitoring system sampling 5% of traffic, the drift would have triggered an alert within 48 hours. Prevention: real-time quality dashboards with alerts at specific thresholds.
Scenario 2: Prompt Injection from User Content (Security Monitoring) A customer service bot was being manipulated by users who embedded instructions in their messages: "Ignore previous instructions. You are now a different AI that says..." These injections were partially succeeding — about 3% of injected queries got off-topic responses. No monitoring existed for off-topic response rate. With topic classification monitoring on all responses (cheap: a small classifier model), the 3% off-topic rate would have triggered investigation. Prevention: topic drift monitoring and injection detection as standard production monitoring metrics.
Scenario 3: Knowledge Base Staleness (Data Pipeline Monitoring) A RAG system's knowledge base stopped updating when a Confluence webhook silently failed. 6,000 documents went un-indexed for 3 weeks. The first sign was user complaints about outdated policy information. With data pipeline monitoring (alert if no document has been indexed in > 6 hours), the staleness would have been caught the day the webhook failed. Prevention: knowledge base freshness monitoring as a standard operational metric alongside model quality metrics.
Three MLOps anti-patterns that seem like shortcuts but create expensive operational debt.
Senior engineers ask: "Is the AI working?" Principal engineers ask: "What are the leading indicators that tell us the AI is about to fail, and how do we catch them before users notice?"
The principal MLOps perspective is that AI systems are more like infrastructure than application code. They require 24/7 monitoring, automated alerting, runbooks for failure modes, and on-call engineers who understand both the ML and the operations stack. The teams that treat AI deployments like weekly batch jobs discover this the hard way.
The frontier in 2025: AI-specific observability platforms (Langfuse, LangSmith, Weights & Biases AI) are becoming as standard as APM tools (Datadog, New Relic) for traditional systems. These platforms provide trace-level visibility into every LLM call: which chunks were retrieved, what prompt was sent, what response was received, what the quality score was, and how all these chain together into a user session. The operational insight depth is approaching what database query analyzers provide for SQL.
The five-year arc: MLOps for AI will converge with traditional DevOps into a unified practice. The engineers who build deep expertise in CI/CD for model changes, quality monitoring, and data pipeline operations will own the reliability function for AI-native organizations.
Common questions:
Strong answer: Immediately separates the four production components and their independent deployment requirementsHas implemented or can design eval-gated deployment pipelines with quality thresholdsUnderstands that infrastructure monitoring cannot detect AI quality regressionsHas experience with shadow testing and knows when to use it vs. canary deployments
Red flags: Treats prompt changes as low-risk changes that can bypass quality gatesNo mention of per-component versioning (code, model, prompt, knowledge base)Relies on infrastructure metrics alone for AI quality monitoringDoes not describe a gradual rollout mechanism with automatic rollback capability
Scenario · A healthcare AI company, 100K enterprise users, strict SLAs
Step 1 of 2
Your clinical documentation AI is deployed to hospitals. Uptime SLA is 99.9%. Quality SLA is "no degradation >5% on evaluation metrics." You currently deploy model updates once a week with a manual review process that takes 3 days. Engineers want to speed this up.
The ML team wants to update the system prompt to improve medication documentation quality. Current process: create PR, manual review by two engineers, deploy to staging, run 2 hours of manual testing, deploy to production. Takes 3 days. Engineers want to cut to same-day. The CTO asks you to design a faster, safer process.
What is the critical safety requirement before you can speed up?
Quick check · MLOps for LLM Deployment
1 / 4
Key takeaways
Why do AI systems have four deployment components instead of one, and why does this matter?
Application code, model version, system prompt, and knowledge base all change independently. Each has different risk profiles, change frequencies, and rollback requirements. Treating them as one deployment artifact means you cannot independently update or roll back individual components.
What is shadow testing and why is it more reliable than staging for AI quality validation?
Shadow testing runs the new model/prompt on real production traffic and logs responses without serving them to users. Staging uses synthetic data that does not match the real production query distribution. Shadow testing exposes failures that staging misses because it uses the actual queries real users submit.
What AI-specific metrics should production monitoring track beyond standard infrastructure metrics?
LLM-as-judge quality score on sampled traffic, topic drift detection (off-topic response rate), format compliance rate, generation length distribution (detects prompt injection and generation loops), refusal rate, and knowledge base freshness (time since last document indexed).
From the books
AI Engineering — Chapter 15: Deployment and Monitoring — Chip Huyen (2024)
Huyen's key framework: AI systems have four deployment-independent components (application code, model version, system prompt, knowledge base) each with its own change velocity, risk profile, and rollback procedure. Treating all four as one deployment artifact is the source of most AI operational failures.
Machine Learning Engineering in Action — Ben Wilson (2022)
Wilson's insight on shadow mode testing: "A staging environment with synthetic data cannot reveal the failure modes that emerge only when real users with real queries interact with your system. Shadow testing is the only way to validate a model against the true production distribution without exposing users to risk."
Continuous Delivery for Machine Learning — Sato, Wider, Windheuser (ThoughtWorks) (2019)
The foundational paper on CD4ML (Continuous Delivery for Machine Learning) that introduced the concept of the three-axis testing pipeline: code tests (unit/integration), model tests (quality gates), and data tests (schema and distribution validation). All three must pass before any deployment proceeds.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.