Multimodal AI Apps: Vision & Document Understanding
Pass images, PDFs, and screenshots straight to a vision-language model and get structured data back, no brittle OCR pipeline required. Here is how to build it, prompt it, and pay for it.
A vendor emails you an invoice. It is a scanned PDF, slightly rotated, with a coffee-ring smudge over the line items and a handwritten note in the margin. You need five fields out of it: invoice number, date, vendor name, total, and currency. The classic answer is an OCR pipeline, run Tesseract, get a soup of characters back, then write regexes and heuristics to find the total. That pipeline works until the next vendor uses a different layout, and then it quietly breaks.
Vision-language models change the shape of this problem. Instead of "extract text, then parse text," you hand the model the image itself and a prompt: "Return invoice_number, date, vendor, total, and currency as JSON." The model reads the document the way a person does, layout, context, and characters together, and gives you the fields. No regex graveyard. This article shows how to build that, prompt it well, and stay honest about what vision tokens cost.
Who this is for
Developers who can already call an LLM API and want to add images, screenshots, and PDFs to the mix. If you have never made an API call to a model, read [Working with the LLM API](/blog/working-with-the-llm-api) first, this builds directly on it.
What "multimodal" actually means
A vision-language model does not run OCR and then read the text, it sees pixels and tokens in the same context window, so layout, color, and characters all inform the answer at once.
The key mental shift: the image is not a separate preprocessing step that produces text for the model. The image *is* input to the model, alongside your words. The model converts the picture into image tokens, places them next to your text tokens, and attends across all of them. That is why it can answer "which row has the highest amount?", it sees the table as a table, not as a flattened character stream.
Photocopying a page, then reading the photocopyOCR-then-parse: lossy text extraction, layout thrown away
Handing the original page to a colleague who reads itVision-language model: pixels + prompt in one context
Asking that colleague a specific question about the pageImage-grounded Q&A: prompt scopes what to extract
Asking for the answer on a sticky note in a fixed formatStructured output: schema-constrained JSON response
Two ways to read a document
The picture: image in, structured data out
A document-extraction flow: the image and a schema-bearing prompt go to the model together; structured JSON comes back and is validated before use.
1
Take the source document
A PDF, a phone photo, or a screenshot. For PDFs, either send pages as images or use a model/endpoint that accepts PDFs directly.
2
Prepare the input
Resize to a sane resolution, encode as base64 (or pass a URL), and choose a detail/resolution level. This step quietly controls most of your cost.
3
Send image plus a schema-bearing prompt
The prompt names the exact fields you want and the format. Attaching a JSON schema or tool definition makes the format non-optional.
4
Validate and parse the response
Never trust raw model text. Parse the JSON, validate it against your schema, and reject or retry on a mismatch.
5
Measure against real documents
Run the whole flow over a labeled set of your actual messy docs, not three clean samples, and track field-level accuracy.
Where vision models earn their keep
Document extraction is the headline use case, but the same "image plus prompt in, structured answer out" pattern covers a surprising range. The approach changes mostly in the prompt and the output schema, not the plumbing.
Use case
Approach
Document extraction
Send the page image plus a field schema; ask for JSON; validate every field against the schema before use.
Screenshot QA / support
Send the UI screenshot plus the user's question; ground the answer in what is visibly on screen; cite the element.
Chart & graph reading
Send the chart image; ask for the underlying data points or the trend in a fixed shape, not free prose.
Image moderation
Send the image plus a policy rubric; ask for a category label and a confidence, not a yes/no, so you can set thresholds.
Common multimodal use cases and how to approach each
Build it: extract fields from an invoice
Here is a complete, copy-pasteable example: read an image off disk, send it with a prompt that demands a strict shape, and parse the structured result. We force the output through a tool/schema definition so the model cannot ramble, it must fill in the fields.
extract_invoice.py
python
import base64, json
from anthropic import Anthropic
client = Anthropic() # reads ANTHROPIC_API_KEY from the environment# 1. Read the image and base64-encode it.withopen("invoice.png", "rb") as f:
image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
# 2. Define the exact shape we want back as a tool.
INVOICE_TOOL = {
"name": "record_invoice",
"description": "Record the fields extracted from an invoice image.",
"input_schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string", "description": "ISO 8601 (YYYY-MM-DD)"},
"vendor": {"type": "string"},
"total": {"type": "number"},
"currency": {"type": "string", "description": "ISO 4217, e.g. EUR"},
},
"required": ["invoice_number", "date", "vendor", "total", "currency"],
},
}
# 3. Send image + prompt together; force the model to call our tool.
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
tools=[INVOICE_TOOL],
tool_choice={"type": "tool", "name": "record_invoice"},
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_b64,
},
},
{
"type": "text",
"text": (
"Extract the invoice fields from this image. ""If a field is genuinely not present, return an empty string ""rather than guessing."
),
},
],
}],
)
# 4. The model's answer is the tool input, already structured.
fields = next(b.input for b in resp.content if b.type == "tool_use")
print(json.dumps(fields, indent=2))
# -> {"invoice_number": "INV-2041", "date": "2026-05-30",# "vendor": "Northwind Supplies", "total": 1284.50, "currency": "EUR"}
Tell it how to fail
The line "return an empty string rather than guessing" is doing real work. Vision models will confidently hallucinate a plausible total if you do not give them an explicit out for missing data. Always design the not-found path into your prompt and schema.
Notice there is no OCR step, no regex, and no layout config. Swap the schema and the prompt and the same 40 lines read a chart, classify a screenshot, or moderate an image. To check exact model ids and image limits before you ship, see the Anthropic API reference.
Vision tokens, resolution, and cost
Images are not free, and they are not priced like text. A model turns your image into image tokens, and you pay for those tokens just like words. The number of tokens scales with the image's resolution: a bigger picture means more tiles, more tokens, more cost, and a little more latency. A rough rule of thumb for many models is tokens ≈ (width × height) / 750, so a 1500×2000 page is on the order of 4,000 tokens, before the model writes a single word of output.
The practical lever is resolution. Send the smallest image that is still legible for your task. For a full-page invoice you usually do not need a 12-megapixel phone photo, a downscaled ~1000–1600px-on-the-long-edge version reads the same text at a fraction of the tokens. Many APIs also expose a detail or resolution setting ("low" vs "high"): low uses a fixed small token budget and is great for "is this a cat or a dog?"; high tiles the image for fine text and costs more. Match the setting to the task.
Choice
Effect
Send the raw 12 MP photo
Maximum tokens and latency; usually wasted, the model down-samples anyway.
Downscale to long edge ~1280px
Text stays legible for most documents at a large token saving.
Low / fixed-detail mode
Tiny fixed token cost; good for coarse classification, bad for fine print.
High / tiled-detail mode
More tokens; needed for dense tables, small fonts, and serial numbers.
Resolution is the cost dial, pick it deliberately
Multi-page PDFs multiply fast
Each page is its own image with its own token bill. A 30-page contract sent at full resolution can dwarf the cost of the prompt. Send only the pages you need, or route a cheap model to find the relevant pages first.
Common mistakes that cost hours (and dollars)
Sending huge images. Forwarding raw camera uploads straight to the API burns tokens for resolution the model discards. Resize before you send; it is the single biggest cost lever.
No structured output. Asking for fields in free prose means you are back to regex-parsing the model's text, the exact pain you were escaping. Use a tool/schema and force it.
No not-found path. Without an explicit "return empty if absent" instruction, vision models invent plausible-looking values. A hallucinated invoice total is worse than a blank one.
Evaluating on clean samples. Three crisp PDFs will pass. Your eval set must be real, messy, rotated, low-light documents, scored field-by-field, that is the only number that predicts production.
Ignoring page count. Treating a multi-page PDF as one cheap call. Each page is a separate image bill; budget and trim accordingly.
Takeaways
The whole article in seven lines
Vision-language models read the image and your prompt in one context, skip the OCR-then-regex pipeline.
The pattern is always: image + schema-bearing prompt in, validated structured output out.
Force the output shape with a tool/JSON schema; never parse free prose.
Always give the model an explicit not-found path so it stops hallucinating fields.
Resolution is your cost dial, send the smallest legible image; match low/high detail to the task.
Multi-page PDFs bill per page; send only the pages you need.
Evaluate on your real, messy documents scored field-by-field, clean samples lie.
Where to go next
Multimodal extraction is the LLM API plus images plus structured output, so the two skills underneath it are the ones to solidify next. Then take it into the AI Engineer path to build it into a real product.
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.