Multimodal AI Apps: Vision & Document Understanding

On this page

The messy invoice problem
What "multimodal" actually means
The picture: image in, structured data out
Where vision models earn their keep
Build it: extract fields from an invoice
Vision tokens, resolution, and cost
Common mistakes that cost hours (and dollars)
Takeaways
Where to go next

TL;DR

Send images, PDFs, and screenshots straight to a vision-language model and get structured data back, no brittle OCR pipeline. You will learn to build it, prompt it for clean fields, and control vision-token cost.

The messy invoice problem

A vendor emails you an invoice. It is a scanned PDF, slightly rotated, with a coffee-ring smudge over the line items and a handwritten note in the margin. You need five fields out of it: invoice number, date, vendor name, total, and currency. The classic answer is an OCR pipeline, run Tesseract, get a soup of characters back, then write regexes and heuristics to find the total. That pipeline works until the next vendor uses a different layout, and then it quietly breaks.

Vision-language models change the shape of this problem. Instead of "extract text, then parse text," you hand the model the image itself and a prompt: "Return invoice_number, date, vendor, total, and currency as JSON." The model reads the document the way a person does, layout, context, and characters together, and gives you the fields. No regex graveyard. This article shows how to build that, prompt it well, and stay honest about what vision tokens cost.

Who this is for

Developers who can already call an LLM API and want to add images, screenshots, and PDFs to the mix. If you have never made an API call to a model, read Working with the LLM API first, this builds directly on it.

What "multimodal" actually means

A vision-language model does not run OCR and then read the text, it sees pixels and tokens in the same context window, so layout, color, and characters all inform the answer at once.

The key mental shift: the image is not a separate preprocessing step that produces text for the model. The image *is* input to the model, alongside your words. The model converts the picture into image tokens, places them next to your text tokens, and attends across all of them. That is why it can answer "which row has the highest amount?", it sees the table as a table, not as a flattened character stream.

Photocopying a page, then reading the photocopyOCR-then-parse: lossy text extraction, layout thrown away

Handing the original page to a colleague who reads itVision-language model: pixels + prompt in one context

Asking that colleague a specific question about the pageImage-grounded Q&A: prompt scopes what to extract

Asking for the answer on a sticky note in a fixed formatStructured output: schema-constrained JSON response

Two ways to read a document

The picture: image in, structured data out

A document-extraction flow: the image and a schema-bearing prompt go to the model together; structured JSON comes back and is validated before use.

1
Take the source document
A PDF, a phone photo, or a screenshot. For PDFs, either send pages as images or use a model/endpoint that accepts PDFs directly.
2
Prepare the input
Resize to a sane resolution, encode as base64 (or pass a URL), and choose a detail/resolution level. This step quietly controls most of your cost.
3
Send image plus a schema-bearing prompt
The prompt names the exact fields you want and the format. Attaching a JSON schema or tool definition makes the format non-optional.
4
Validate and parse the response
Never trust raw model text. Parse the JSON, validate it against your schema, and reject or retry on a mismatch.
5
Measure against real documents
Run the whole flow over a labeled set of your actual messy docs, not three clean samples, and track field-level accuracy.

Where vision models earn their keep

Document extraction is the headline use case, but the same "image plus prompt in, structured answer out" pattern covers a surprising range. The approach changes mostly in the prompt and the output schema, not the plumbing.

Use case	Approach
Document extraction	Send the page image plus a field schema; ask for JSON; validate every field against the schema before use.
Screenshot QA / support	Send the UI screenshot plus the user's question; ground the answer in what is visibly on screen; cite the element.
Chart & graph reading	Send the chart image; ask for the underlying data points or the trend in a fixed shape, not free prose.
Image moderation	Send the image plus a policy rubric; ask for a category label and a confidence, not a yes/no, so you can set thresholds.

Common multimodal use cases and how to approach each

Build it: extract fields from an invoice

Here is a complete, copy-pasteable example: read an image off disk, send it with a prompt that demands a strict shape, and parse the structured result. We force the output through a tool/schema definition so the model cannot ramble, it must fill in the fields.

extract_invoice.py

python

import base64, json
from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from the environment

# 1. Read the image and base64-encode it.
with open("invoice.png", "rb") as f:
    image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

# 2. Define the exact shape we want back as a tool.
INVOICE_TOOL = {
    "name": "record_invoice",
    "description": "Record the fields extracted from an invoice image.",
    "input_schema": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "date":           {"type": "string", "description": "ISO 8601 (YYYY-MM-DD)"},
            "vendor":         {"type": "string"},
            "total":          {"type": "number"},
            "currency":       {"type": "string", "description": "ISO 4217, e.g. EUR"},
        },
        "required": ["invoice_number", "date", "vendor", "total", "currency"],
    },
}

# 3. Send image + prompt together; force the model to call our tool.
resp = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=512,
    tools=[INVOICE_TOOL],
    tool_choice={"type": "tool", "name": "record_invoice"},
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_b64,
                },
            },
            {
                "type": "text",
                "text": (
                    "Extract the invoice fields from this image. "
                    "If a field is genuinely not present, return an empty string "
                    "rather than guessing."
                ),
            },
        ],
    }],
)

# 4. The model's answer is the tool input, already structured.
fields = next(b.input for b in resp.content if b.type == "tool_use")
print(json.dumps(fields, indent=2))
# -> {"invoice_number": "INV-2041", "date": "2026-05-30",
#     "vendor": "Northwind Supplies", "total": 1284.50, "currency": "EUR"}

Tell it how to fail

The line "return an empty string rather than guessing" is doing real work. Vision models will confidently hallucinate a plausible total if you do not give them an explicit out for missing data. Always design the not-found path into your prompt and schema.

Notice there is no OCR step, no regex, and no layout config. Swap the schema and the prompt and the same 40 lines read a chart, classify a screenshot, or moderate an image. To check exact model ids and image limits before you ship, see the Anthropic API reference.

Vision tokens, resolution, and cost

Images are not free, and they are not priced like text. A model turns your image into image tokens, and you pay for those tokens just like words. The number of tokens scales with the image's resolution: a bigger picture means more tiles, more tokens, more cost, and a little more latency. A rough rule of thumb for many models is tokens ≈ (width × height) / 750, so a 1500×2000 page is on the order of 4,000 tokens, before the model writes a single word of output.

The practical lever is resolution. Send the smallest image that is still legible for your task. For a full-page invoice you usually do not need a 12-megapixel phone photo, a downscaled ~1000–1600px-on-the-long-edge version reads the same text at a fraction of the tokens. Many APIs also expose a detail or resolution setting ("low" vs "high"): low uses a fixed small token budget and is great for "is this a cat or a dog?"; high tiles the image for fine text and costs more. Match the setting to the task.

Choice	Effect
Send the raw 12 MP photo	Maximum tokens and latency; usually wasted, the model down-samples anyway.
Downscale to long edge ~1280px	Text stays legible for most documents at a large token saving.
Low / fixed-detail mode	Tiny fixed token cost; good for coarse classification, bad for fine print.
High / tiled-detail mode	More tokens; needed for dense tables, small fonts, and serial numbers.

Resolution is the cost dial, pick it deliberately

Multi-page PDFs multiply fast

Each page is its own image with its own token bill. A 30-page contract sent at full resolution can dwarf the cost of the prompt. Send only the pages you need, or route a cheap model to find the relevant pages first.

Common mistakes that cost hours (and dollars)

1Sending huge images. Forwarding raw camera uploads straight to the API burns tokens for resolution the model discards. Resize before you send; it is the single biggest cost lever.
2No structured output. Asking for fields in free prose means you are back to regex-parsing the model's text, the exact pain you were escaping. Use a tool/schema and force it.
3No not-found path. Without an explicit "return empty if absent" instruction, vision models invent plausible-looking values. A hallucinated invoice total is worse than a blank one.
4Evaluating on clean samples. Three crisp PDFs will pass. Your eval set must be real, messy, rotated, low-light documents, scored field-by-field, that is the only number that predicts production.
5Ignoring page count. Treating a multi-page PDF as one cheap call. Each page is a separate image bill; budget and trim accordingly.

Takeaways

The whole article in seven lines

Vision-language models read the image and your prompt in one context, skip the OCR-then-regex pipeline.
The pattern is always: image + schema-bearing prompt in, validated structured output out.
Force the output shape with a tool/JSON schema; never parse free prose.
Always give the model an explicit not-found path so it stops hallucinating fields.
Resolution is your cost dial, send the smallest legible image; match low/high detail to the task.
Multi-page PDFs bill per page; send only the pages you need.
Evaluate on your real, messy documents scored field-by-field, clean samples lie.

Where to go next

Multimodal extraction is the LLM API plus images plus structured output, so the two skills underneath it are the ones to solidify next. Then take it into the AI Engineer path to build it into a real product.

Working with the LLM API, the request/response basics this article assumes.
Structured Output & Tool Calling, how to force JSON shapes and parse them safely.
AI Engineer career path, the full track from prompting to production AI features.

Check your understanding

1. What is the core mental shift the article describes for vision-language models versus a classic OCR pipeline?

2. Why does the article insist on telling the model to 'return an empty string rather than guessing'?

Frequently asked questions

How is a vision-language model different from an OCR pipeline?

An OCR pipeline extracts characters first and then parses that text with regexes and heuristics, which breaks the moment a vendor uses a different layout. A vision-language model sees pixels and tokens in the same context window, so it reads layout, color, and characters together and returns the fields directly, with no regex graveyard.

How do I stop a vision model from hallucinating a value that is not in the document?

Design the not-found path into your prompt and schema, for example instructing the model to return an empty string rather than guessing. Vision models will confidently invent a plausible total if you do not give them an explicit out for missing data.

How do I get clean structured data instead of a rambling answer?

Force the output through a tool or schema definition so the model must fill in specific fields rather than write free text. Pairing the image with a prompt that demands a strict shape, like returning invoice_number, date, vendor, total, and currency as JSON, keeps the result parseable.

Is document extraction the only use case for this approach?

No. The same pattern of image plus prompt in, structured answer out covers a surprising range of tasks. What changes is mostly the prompt and the output schema, not the underlying plumbing.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read

Multimodal AI Apps: Vision & Document Understanding

01The messy invoice problem

02What "multimodal" actually means

03The picture: image in, structured data out

04Where vision models earn their keep

05Build it: extract fields from an invoice

06Vision tokens, resolution, and cost

07Common mistakes that cost hours (and dollars)

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

RAG Architecture Explained for Backend Engineers

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

The messy invoice problem

What "multimodal" actually means

The picture: image in, structured data out

Where vision models earn their keep

Build it: extract fields from an invoice

Vision tokens, resolution, and cost

Common mistakes that cost hours (and dollars)

Takeaways

Where to go next