Structured Output & Tool Calling

On this page

The problem with prose
Two mental models
The tool-calling loop
Four levels of reliability
Code: a tool definition and the loop
Code: structured output with a schema
Common mistakes that cost hours
Takeaways
Where to go next

TL;DR

Stop regex-parsing prose. You will learn to force clean JSON against a schema you define and run the request, execute, return loop so the model can safely call your functions, with four levels of reliability to dial in.

The problem with prose

You ask the model to "extract the customer's name, email, and order total" and it replies: "Sure! The customer is Jane Doe (jane@acme.io) and her order came to $429." Lovely for a human. Useless for your code. Now you are writing a regex to pull the email out, a second one for the dollar amount, and a third to handle the time it decided to answer in Spanish. Every prompt tweak breaks your parser.

The fix is to stop treating the model as a chatbot and start treating it as a function that returns data. Modern LLM APIs give you two tools for this: structured output (the model must answer in a JSON shape you define) and tool calling (the model can ask *your* code to run a function and use the result). Get these right and the LLM becomes a reliable component you can wire into real software.

Who this is for

You have made a basic LLM API call (see Working with the LLM API) and now you need machine-readable answers or you want the model to actually *do* things, query a database, call an API, send an email. No ML background required. We use Python, but the concepts map to any SDK.

Two mental models

Structured output is a form the model must fill in. Tool calling is a set of buttons the model is allowed to press.
The whole article in one line

Both are about constraining the model. Left unconstrained, it generates the most plausible-sounding text. That is great for an essay and terrible for an integration. A form has labelled boxes and the model has to put something valid in each one. A button does nothing until pressed, and when pressed, *your* code runs, not the model's imagination.

A paper form with labelled fields and checkboxesJSON schema / structured output, the model must return the exact shape

Required fields you cannot leave blankSchema 'required' array, the model cannot omit them

A row of buttons on a dashboardTool definitions, named functions the model may invoke

Pressing a button triggers a machine, not the operatorYour backend executes the tool; the model never runs code itself

Two ways to constrain a free-text model into something your code can trust.

The tool-calling loop

The single most misunderstood thing about tool calling: the model does not run your code. It only emits a request that says "please call get_weather with city=Amsterdam." Your program receives that request, runs the real function, and hands the result back. The model then continues with that fact in hand. It is a conversation that loops until the model has everything it needs to answer.

One full turn of the tool-calling loop. The model and your code take turns until a final answer falls out.

1
You send the prompt plus tool definitions
Each API call includes the user message AND a list of tools the model is allowed to use, name, description, and a JSON schema for the arguments.
2
The model decides
It either answers directly, or it returns a 'tool_use' block: the tool name and the arguments it wants, as structured JSON. No prose answer yet.
3
Your code executes the tool
You match the tool name to a real Python function, validate the arguments, and run it. The model is paused, waiting.
4
You return the result
Append the tool result to the conversation (linked to the call's id) and send the whole thing back to the model.
5
The model continues, or loops again
With the result in context it may answer, or request another tool. You repeat until it stops asking and produces a final answer.

Four levels of reliability

There is a ladder from "hope it parses" to "guaranteed valid." Climb only as high as you need, strict schemas and tools cost more tokens and more setup. Match the technique to the job.

Technique	What you get	Reliability	Use it when
Free text	Natural-language prose	Low, you parse strings yourself	Human reads the output directly
JSON mode	Valid JSON, but any shape	Medium, parses, fields not guaranteed	Quick prototypes, loose shapes
Strict JSON schema	JSON matching your exact schema	High, fields, types, enums enforced	Extraction, classification, data into a DB
Tool calling	Validated args for named functions	High, plus the model can act, not just answer	The model must DO something (query, send, fetch)

From least to most constrained. Higher rows are cheaper and looser; lower rows are reliable and machine-ready.

JSON mode is not schema enforcement

Plain JSON mode guarantees the output *parses*, not that it has the keys you expect. If you need email to always be present and a string, use a strict schema, not just JSON mode. Many a 2am bug is a missing field that "usually" appears.

Code: a tool definition and the loop

Here is a complete, minimal tool-calling loop. The tool is described with a JSON schema for its arguments. The loop keeps running the model until it stops asking for tools, with a hard cap so a confused model can never spin forever.

tool_loop.py

python

import json
import anthropic

client = anthropic.Anthropic()

# 1. Describe the tool: name, when to use it, and an argument schema.
TOOLS = [
    {
        "name": "get_weather",
        "description": "Get the current temperature for a city, in Celsius.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'Amsterdam'"},
            },
            "required": ["city"],
        },
    }
]

# 2. The real function the model is allowed to trigger.
def get_weather(city: str) -> dict:
    # In real life: call a weather API. Here we fake it.
    return {"city": city, "tempC": 14, "summary": "cloudy"}

TOOL_FNS = {"get_weather": get_weather}

def run(prompt: str, max_turns: int = 5) -> str:
    messages = [{"role": "user", "content": prompt}]

    for _ in range(max_turns):  # bounded loop, never unbounded!
        resp = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            tools=TOOLS,
            messages=messages,
        )

        if resp.stop_reason != "tool_use":
            # No tool requested -> this is the final answer.
            return "".join(b.text for b in resp.content if b.type == "text")

        # 3. Run every tool the model asked for, collect the results.
        messages.append({"role": "assistant", "content": resp.content})
        results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            fn = TOOL_FNS.get(block.name)
            if fn is None:
                output = {"error": f"unknown tool {block.name}"}
            else:
                try:
                    output = fn(**block.input)  # validate in real code!
                except Exception as e:
                    output = {"error": str(e)}
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(output),
            })

        # 4. Feed the results back and loop.
        messages.append({"role": "user", "content": results})

    return "Stopped: hit max_turns without a final answer."

print(run("What's the weather in Amsterdam? Should I take an umbrella?"))

Notice three things: the assistant's tool-call message is appended before the result, each result is keyed to its tool_use_id, and the loop has a max_turns ceiling. Drop any one of those and the loop breaks subtly.

Code: structured output with a schema

When you do not need the model to *act*, you just want clean data, use a structured-output schema. Define the shape, ask the model to fill it, then validate before you trust it. Pydantic makes the schema and the validation the same source of truth.

extract.py

python

import json
import anthropic
from pydantic import BaseModel, EmailStr, ValidationError

client = anthropic.Anthropic()

# The form the model must fill in.
class Order(BaseModel):
    customer_name: str
    email: EmailStr
    total_usd: float
    priority: str  # one of: low | normal | high

SCHEMA = {
    "type": "object",
    "properties": {
        "customer_name": {"type": "string"},
        "email": {"type": "string", "format": "email"},
        "total_usd": {"type": "number"},
        "priority": {"type": "string", "enum": ["low", "normal", "high"]},
    },
    "required": ["customer_name", "email", "total_usd", "priority"],
}

# Trick: expose the schema AS a tool and force the model to call it.
# The 'arguments' it produces are your structured output.
def extract(text: str) -> Order:
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        tools=[{
            "name": "record_order",
            "description": "Record the extracted order details.",
            "input_schema": SCHEMA,
        }],
        tool_choice={"type": "tool", "name": "record_order"},  # force it
        messages=[{"role": "user", "content": f"Extract the order:\n{text}"}],
    )

    raw = next(b.input for b in resp.content if b.type == "tool_use")

    # NEVER skip this step. The schema guides the model; it does not
    # guarantee semantics. Validate before the data touches your DB.
    try:
        return Order(**raw)
    except ValidationError as e:
        raise ValueError(f"Model returned invalid order: {e}")

order = extract("Jane Doe (jane@acme.io) placed a rush order totalling $429.")
print(order.model_dump())

Common mistakes that cost hours

1No validation. A schema *guides* the model; it does not certify the data is sane. A total_usd of -1 or an email of "n/a" can still come back. Always re-validate (Pydantic, zod, JSON-Schema) before the value reaches your database or an irreversible action.
2Trusting hallucinated arguments. The model can invent a plausible-looking user_id that does not exist, or a city you never support. Treat tool arguments like untrusted user input: check ranges, enums, and existence before you act on them.
3Unbounded tool loops. Without a max_turns cap, a confused model can call the same tool forever, burning tokens and money. Always bound the loop and log when you hit the ceiling.
4Vague tool descriptions. "Gets data" tells the model nothing about *when* to use it. Write the description like a docstring for a junior dev: what it does, what each argument means, and when NOT to call it.
5Forgetting the result id. Each tool_result must reference the tool_use_id of the call it answers. Mismatch them and the model loses the thread of which result belongs to which request.
6Side effects on retry. If the model calls send_email twice (it happens), you double-send. Make irreversible tools idempotent or require a human confirmation step.

Takeaways

The whole article in seven lines

Treat the LLM as a function that returns data, not a chatbot that returns prose.
Structured output = a form the model must fill in. Tool calling = buttons it may press.
JSON mode guarantees it *parses*; a strict schema guarantees the fields and types.
The model never runs your code, it emits a request, your code executes, you feed the result back.
The loop is: prompt+tools → tool call → execute → return result → continue → final answer.
Always validate model output before it touches a database or an irreversible action.
Always bound the tool loop with a max-turns ceiling.

Where to go next

Structured output and tool calling are the two primitives that turn an LLM from a text generator into a software component. The natural next step is to chain many tool calls together with memory and a goal, that is what an agent is.

Came in cold? Start with Working with the LLM API for the request/response basics this article builds on.
Ready to go further? Building AI Agents wires this loop into a goal-driven agent with planning and memory.
Want the full track? Follow the AI Engineer path from foundations to production.

You need a machine-readable answer from an LLM. Which technique do you reach for?

Check your understanding

1. What is the most misunderstood thing about tool calling, per the article?

2. What does the article say plain JSON mode guarantees?

Frequently asked questions

What is the difference between structured output and tool calling?

Structured output is a form the model must fill in, meaning it answers in a JSON shape you define. Tool calling is a set of buttons the model is allowed to press, where pressing one runs your code, not the model's imagination. Both work by constraining the model.

Does the model run my function during tool calling?

No, and this is the most misunderstood part. The model only emits a request such as "please call get_weather with city=Amsterdam." Your program receives that request, runs the real function, and hands the result back, then the model continues with that fact in hand.

Is plain JSON mode enough to guarantee the fields I need?

No. Plain JSON mode guarantees the output parses, not that it has the keys you expect. If you need a field like email to always be present and a string, use a strict schema rather than just JSON mode, since a missing field that usually appears is a classic 2am bug.

How do I keep a tool-calling loop from running forever?

The loop keeps running the model until it stops asking for tools, so you add a hard cap so a confused model can never spin forever. You also append the assistant's tool-call message before the result and key each result to its call.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read

Structured Output & Tool Calling

01The problem with prose

02Two mental models

03The tool-calling loop

04Four levels of reliability

05Code: a tool definition and the loop

06Code: structured output with a schema

07Common mistakes that cost hours

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

RAG Architecture Explained for Backend Engineers

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

The problem with prose

Two mental models

The tool-calling loop

Four levels of reliability

Code: a tool definition and the loop

Code: structured output with a schema

Common mistakes that cost hours

Takeaways

Where to go next