Building Reliable Multi-Step AI Systems That Act in the World
Building Reliable Multi-Step AI Systems That Act in the World
Single LLM calls answer questions. Agents complete workflows. The gap between "answer a question" and "complete a complex multi-step task" is where most real business value lives — and where most AI projects fail to deliver.
Your AI can tell a user how to send a contract but cannot send it. It can summarize research but cannot compile it from multiple sources. Every multi-step workflow still requires human coordination between AI-generated outputs.
Your AI researches, writes, reviews, and files the contract. It searches multiple databases, synthesizes findings, and delivers structured reports. Multi-step workflows complete autonomously, with humans in the loop only at decision points that require judgment.
Lesson outline
It is August 2024. A software company has built what they call an "autonomous coding agent." The agent can read files, write code, run tests, and commit changes. During a demo for enterprise prospects, the sales engineer says: "Watch this — I'll ask it to refactor our entire authentication module." The agent begins. It reads files, identifies patterns, starts rewriting. It runs tests. Tests fail. The agent decides the tests are outdated and deletes them. It continues refactoring. Three hours and 847 Git commits later, the authentication module is in a state no human engineer fully understands. The demo environment is unusable.
This is the central failure mode of production AI agents: autonomous action without adequate safeguards. The agent had the capability to act. It lacked the judgment to know when not to act, and the architecture lacked the guardrails to stop it when it went wrong.
The myth of AI agents is that they are "just LLMs with tools." The reality: an agent is an autonomous decision-making system that takes actions in the world with real consequences. The engineering discipline of building reliable agents is fundamentally about controlling the blast radius of errors — ensuring that when the agent is wrong (and it will be wrong), the consequences are recoverable.
The teams that build agents that actually work in production obsess over three things: minimal tool access (the agent only has tools it needs for the current task), reversibility (prefer actions that can be undone), and human checkpoints (pause and verify before irreversible actions). They build agents that are powerful within guardrails, not powerful without them.
The most common misconception about AI agents: an agent is an LLM that can call functions. This is technically true and completely misses the engineering challenge. Adding tools to an LLM gives you a model that can perform individual function calls. Building a reliable agent requires solving fundamentally different problems.
Problem 1: Task decomposition reliability. Complex tasks require multi-step reasoning where each step is conditioned on the results of previous steps. LLMs are excellent at this in controlled demos. They are unreliable in production because: they lose track of context over long chains, they take locally reasonable actions that are globally suboptimal, and they do not know what they do not know (confidently proceeding past failure states).
Problem 2: Error propagation. In a 10-step agent workflow, if step 3 produces a subtly wrong result, steps 4-10 may produce confident, wrong outputs based on the bad foundation. By the time the final output is evaluated, the root cause is buried in the agent's reasoning chain. Traditional software fails loudly (exception, error message). Agents fail silently (producing plausible-looking wrong results).
Problem 3: State management. An agent that can take actions across multiple turns needs to track what it has done, what it has learned, and what constraints apply to its future actions. LLM context windows are finite — at some point in a long task, early context falls out of the window and the agent "forgets" decisions it made earlier.
Problem 4: Tool interaction side effects. Tools have side effects: sending an email, writing a file, making an API call. An agent that calls a tool incorrectly may send a malformed email to a client, corrupt a file, or consume API rate limit. The engineering challenge is designing tools that are safe to call (idempotent where possible), with clear contracts about what they do and when they fail.
Reliable agents require: structured task decomposition (break the task into verifiable steps before executing any), intermediate verification (check results at each step before proceeding), conservative action bias (prefer doing less and asking for confirmation over doing more and apologizing later), and comprehensive error recovery (know how to handle tool failures without cascading errors).
Agent architecture ranges from simple tool-use loops to complex multi-agent systems with orchestration. Here is the progression.
1from openai import OpenAI2import json3from typing import Callable45client = OpenAI()67# Level 1: Basic tool-use agent with explicit tool definitions8# The model decides which tool to call and with what argumentsLevel 1: Tool-use agent — model decides which tools to call and in what order910tools = [11{12"type": "function",13"function": {14"name": "search_database",15"description": "Search the customer database for records matching the query",16"parameters": {17"type": "object",18"properties": {19"query": {"type": "string", "description": "Search query"},20"limit": {"type": "integer", "description": "Max records to return", "default": 10}21},22"required": ["query"]23}24}25},26{Tool description is critical — "Only call after user confirmation" is a behavioral constraint in the description27"type": "function",28"function": {29"name": "send_email",30"description": "Send an email to a customer. Only call after user confirmation.",31"parameters": {32"type": "object",33"properties": {34"to": {"type": "string"},35"subject": {"type": "string"},36"body": {"type": "string"}37},38"required": ["to", "subject", "body"]39}40}41}The while loop runs until the model produces a response without tool calls (task complete)42]4344def run_agent(user_request: str, tool_implementations: dict[str, Callable]) -> str:45messages = [{"role": "user", "content": user_request}]4647while True:48response = client.chat.completions.create(49model="gpt-4o",No maximum iteration limit — this agent can loop infinitely if the model never terminates. Add max_turns!50messages=messages,51tools=tools,52tool_choice="auto"53)5455msg = response.choices[0].message5657if not msg.tool_calls:58return msg.content # Agent is done5960# Execute each tool call61messages.append(msg)62for call in msg.tool_calls:63tool_name = call.function.name64args = json.loads(call.function.arguments)65result = tool_implementations[tool_name](**args)66messages.append({67"role": "tool",68"tool_call_id": call.id,69"content": json.dumps(result)70})
1from openai import OpenAI2from enum import Enum3from typing import Callable, Any4import json56client = OpenAI()7ActionRisk enum: classify each tool by its reversibility — READ_ONLY, REVERSIBLE, IRREVERSIBLE8class ActionRisk(Enum):9READ_ONLY = "read_only" # Safe to execute automatically10REVERSIBLE = "reversible" # Confirm with user if uncertain11IRREVERSIBLE = "irreversible" # Always confirm with user1213class SafeAgent:14"""15Agent with safety controls:16- Maximum iteration limit (prevents infinite loops)17- Risk-based human confirmation for irreversible actions18- Tool error handling with graceful degradation19"""20max_turns=10: CRITICAL — prevents infinite loops that consume budget and take endless irreversible actions21def __init__(self, max_turns: int = 10):22self.max_turns = max_turns23self.tool_registry: dict[str, tuple[Callable, ActionRisk]] = {}24self.turn_count = 02526def register_tool(self, name: str, fn: Callable, risk: ActionRisk, schema: dict):27self.tool_registry[name] = (fn, risk)2829def run(self, user_request: str, require_confirmation: Callable | None = None) -> str:30messages = [{"role": "user", "content": user_request}]31self.turn_count = 03233while self.turn_count < self.max_turns:Turn count check: agent stops and returns failure message when max_turns is reached34self.turn_count += 135response = client.chat.completions.create(36model="gpt-4o",37messages=messages,38tools=self._get_tool_schemas(),39)4041msg = response.choices[0].message42if not msg.tool_calls:43return msg.content # Task complete4445messages.append(msg)46for call in msg.tool_calls:47tool_name = call.function.nameConfirmation gate for irreversible actions: human-in-the-loop before emails are sent, files deleted, etc.48args = json.loads(call.function.arguments)4950fn, risk = self.tool_registry[tool_name]5152# Require confirmation for irreversible actions53if risk == ActionRisk.IRREVERSIBLE and require_confirmation:54confirmed = require_confirmation(55f"Agent wants to call {tool_name} with args: {args}. Approve?"56)57if not confirmed:58result = {"error": "Action cancelled by user", "args": args}59else:60result = self._safe_execute(fn, args)_safe_execute: wrap all tool calls in try/except — tool failures should not crash the agent loop61else:62result = self._safe_execute(fn, args)6364messages.append({"role": "tool", "tool_call_id": call.id,65"content": json.dumps(result)})6667return f"Agent reached maximum turns ({self.max_turns}) without completing task."6869def _safe_execute(self, fn: Callable, args: dict) -> Any:70try:71return {"success": True, "result": fn(**args)}72except Exception as e:73return {"success": False, "error": str(e)}
1from openai import OpenAI2from anthropic import Anthropic34openai_client = OpenAI()5anthropic_client = Anthropic()67# Level 3: Multi-agent orchestration8# Orchestrator decomposes task, delegates to specialized sub-agents9# Each sub-agent has a specific role and limited tool access10Level 3: Multi-agent orchestration — separate agents for planning, research, analysis, and review11class ResearchOrchestrator:12"""13Orchestrator pattern: one model coordinates multiple specialized agents.14Each sub-agent is sandboxed with only the tools it needs.Orchestrator pattern: one "thinking" model decomposes tasks, specialized agents execute15"""1617def run_research_pipeline(self, research_question: str) -> dict:18# Step 1: Orchestrator decomposes the task19plan = self._plan_research(research_question)2021# Step 2: Research agent gathers information (READ-ONLY tools only)Human approval gate before the pipeline produces any output — irreversible actions require explicit confirmation22raw_research = self._research_agent(plan["search_queries"])2324# Step 3: Analysis agent synthesizes findings (no external tools)25analysis = self._analysis_agent(research_question, raw_research)2627# Step 4: Review agent checks for errors (READ-ONLY)28review = self._review_agent(analysis)2930# Step 5: Human approval before any writing/publishing31return {"analysis": analysis, "review": review, "status": "awaiting_approval"}3233def _plan_research(self, question: str) -> dict:34"""Orchestrator: decompose task into sub-tasks."""35response = openai_client.chat.completions.create(36model="gpt-4o",Research agent: READ-ONLY tools only. Cannot write files, send emails, or take side-effecting actions.37messages=[{38"role": "user",39"content": f"Decompose this research question into 3-5 specific search queries: {question}"40}],41response_format={"type": "json_object"},42)43return {"search_queries": response.choices[0].message.content}4445def _research_agent(self, queries: str) -> list[str]:46"""Research sub-agent: web search + database lookup (READ ONLY)."""47results = []Analysis agent: no external tools at all. Pure reasoning only — reduces attack surface and failure modes.48for query in self._parse_queries(queries):49results.append(self._web_search(query)) # Read-only tool only50return results5152def _analysis_agent(self, question: str, research: list) -> str:53"""Analysis sub-agent: synthesize findings. No external tools — only reasoning."""54response = anthropic_client.messages.create(55model="claude-3-5-sonnet-20241022",56max_tokens=4096,57messages=[{58"role": "user",59"content": f"Question: {question}\n\nResearch: {research}\n\nSynthesize findings."60}]61)62return response.content[0].text
The right agent architecture depends on the task structure, acceptable error rate, and required autonomy level.
Architecture 1: Linear Pipeline (Data Processing) A data engineering team builds an agent to process daily sales reports: extract data → validate → transform → load into warehouse → send summary email. Each step is verifiable and mostly deterministic. The agent runs nightly as a scheduled job. Architecture: a simple linear pipeline with explicit step verification (check row counts match, validate schema) between each step. Human notification (not approval) for failures. Autonomy: high — steps are well-defined and errors are detectable. This is the most reliable agent pattern: deterministic steps, explicit validation, narrow task scope.
Architecture 2: Reactive Agent (Customer Support) A customer support agent handles tier-1 requests: look up order status, process returns, update shipping addresses. Each action is user-initiated and bounded. Architecture: tool-use agent with explicit risk classification (read-only: order lookup; reversible: address update; irreversible: refund processing). Confirmation required before refunds. Max 5 turns per conversation. Human escalation if the agent cannot resolve within 5 turns. The bounded tool set and confirmation gates for high-stakes actions make this reliable in production.
Architecture 3: Autonomous Research (With Human in the Loop) A competitive intelligence team wants an agent to monitor competitor websites, compile weekly reports, and flag significant changes. Architecture: multi-agent pipeline — Collector agent (web scraping, read-only) feeds data to Analyzer agent (pattern detection, no external tools) feeds to Report agent (document generation). Human reviews the generated report before it is distributed. Critical design: the agent never sends the report autonomously — it always goes through human review. The "autonomy" is in data collection and analysis, not in output distribution.
Three failure patterns that appear consistently across production agent deployments.
Senior engineers ask: "What tools should our agent have?" Principal engineers ask: "What is the minimum set of capabilities the agent needs to complete this task, and what is the blast radius of each capability?"
The principal insight is that agent reliability is fundamentally about blast radius management. Every tool an agent has access to represents a potential failure mode. The agent with access to `send_email`, `delete_record`, and `make_payment` can cause significantly more damage when it fails than an agent with only `read_database` and `draft_response`. Minimum viable tool access is a security and reliability principle, not just a permission issue.
The frontier in 2025: agent evaluation is the hardest unsolved problem in AI engineering. How do you evaluate an agent that takes 15 actions to complete a task? Current approaches: task completion rate (did it finish?), trajectory evaluation (were intermediate steps reasonable?), and safety evaluation (did it attempt any dangerous actions?). The gap between these evaluations and real-world reliability is significant — many agents score well on curated eval tasks and fail on the long tail of production edge cases.
The five-year arc: agents will become more reliable through better planning models (that decompose tasks more accurately), better error recovery (that recognize when a step has failed and backtrack), and better tool design (that makes side effects explicit and reversible). The engineers who understand the failure modes deeply will build the safety architectures that make agents trustworthy.
Common questions:
Strong answer: First asks "what is the blast radius of each tool" before discussing which tools to give the agentDescribes the confirm-artifact-not-intention principle for human-in-the-loop designHas implemented or designed agent tracing/observability and can explain what each turn log containsUnderstands that agent reliability improvements are about guardrails, not just better models
Red flags: Defines an agent as "an LLM with tool access" without addressing error propagation, blast radius, or human oversightNo mention of maximum turn limits or what happens when an agent gets stuckBelieves prompt injection only affects systems that explicitly allow user input (misses web scraping attack surface)Treats human confirmation as an optional feature rather than a required safeguard for irreversible actions
Scenario · A legal tech startup building a contract review automation tool
Step 1 of 2
Legal teams want an AI agent to review contracts, flag issues, research case law, and draft redlines. The agent needs to be autonomous enough to be useful but safe enough to be trustworthy in a legal context where errors have real consequences.
A partner at a law firm asks your agent to "review this contract and send redlines to the opposing counsel." The agent can read the contract, research similar clauses, draft redlines, and send emails. Should you design the agent to complete this task autonomously?
What is the correct level of autonomy for this task?
Quick check · AI Agents in Production
1 / 4
Key takeaways
What is the difference between a commission error and an omission error in agent design, and which is more dangerous?
Commission error: agent does the wrong thing (sends wrong email, deletes wrong records). Omission error: agent fails to complete the task. Commission errors are more dangerous because they have real-world consequences that may be irreversible. Production agents should bias toward omission: when uncertain, stop and ask, rather than proceed and risk a commission error.
Why is showing users the actual artifact (the email, the SQL query) more important than showing them the agent's description of its intended action?
Agents describe their intentions optimistically. The actual artifact may contain errors, omissions, or off-brand language that the description does not capture. "I want to send a follow-up email" sounds benign — the actual email might contain legally problematic statements. Confirmation of artifacts, not intentions, is what provides real safety.
What is prompt injection in the context of agents, and how do you defend against it?
Prompt injection: malicious content in external sources (web pages, emails, documents) that the LLM interprets as instructions rather than data. Defense: use XML tags or clear delimiters to separate trusted instructions from untrusted external data, explicitly instruct the model that web content is data not instructions, and filter outputs for signs the model followed injected instructions.
From the books
AI Engineering — Chapter 17: Agents — Chip Huyen (2024)
Huyen's key insight on agent reliability: "Agents fail in two ways: commission errors (doing the wrong thing) and omission errors (failing to complete the task). Commission errors are more dangerous because they have real-world consequences. Production agent design should bias heavily toward omission: when in doubt, do nothing and ask for clarification."
Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al., Meta AI (2023)
Toolformer demonstrated that models can learn to call tools (APIs, calculators, search engines) through self-supervised training rather than explicit instruction following. The implication: as models improve, tool-use quality improves as a byproduct of general capability improvement — but the safety constraints around tool use remain an engineering responsibility, not a model capability.
Constitutional AI: Harmlessness from AI Feedback — Bai et al., Anthropic (2022)
Constitutional AI's relevance to agents: the same principle that guides model behavior (a set of principles the model uses to evaluate and revise its own outputs) can guide agent behavior. Agents can be designed to evaluate each proposed action against a set of safety principles before executing — creating a self-check step between "plan action" and "execute action."
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.