Building Applications That See, Hear, and Understand
Building Applications That See, Hear, and Understand
Most real-world information exists in images, PDFs, audio, and video — not clean text. A text-only AI assistant cannot read invoices, analyze charts, describe product images, or transcribe meeting recordings.
Your AI cannot process the formats your users actually work with. Users manually extract text from images and PDFs before they can use your AI — a friction point that kills adoption for document-heavy workflows.
Your AI reads invoices directly from image uploads, extracts structured data from PDF forms, describes product images for accessibility, and generates images from natural language descriptions. The AI meets users in their native format.
Lesson outline
In early 2024, a logistics company's operations team was manually processing 3,000 freight invoices per week. Each invoice was a PDF scan — sometimes a photo taken on a phone. The invoices came from 400+ carriers, each with a different format. The company had already built an AI system for their text-based communications. But invoices? The AI could not see them.
The solution seemed obvious: use OCR first, then pass the text to the language model. The team built this pipeline. Accuracy: 74%. For invoices, 74% accuracy meant 780 invoices per week requiring manual review — almost as many as before. OCR failed on handwritten notes, low-resolution photos, tables with merged cells, and non-standard fonts.
GPT-4o with vision changed this. The model could directly analyze the invoice image, understand the layout, extract the table structure, and interpret the handwritten annotations — without an OCR preprocessing step. Accuracy jumped to 91%. The remaining 9% required review — 270 invoices per week vs. 780 before, and the failures were concentrated in consistently bad-quality images that even humans found difficult.
The insight was not that vision models replaced OCR — it is that native multimodal understanding of documents is qualitatively different from text extraction followed by language model processing. The model understands the image as a whole, not as a sequence of extracted strings.
Multimodal AI is not about adding an image upload feature. It is about understanding that information has modality — documents, images, audio, and video each have structure that matters — and building systems that work with that structure natively.
The first engineering intuition about multimodal AI: "just extract the text and use your existing language model pipeline." This works for some cases and fails catastrophically for others. Understanding why tells you when native multimodal is necessary.
Cases where text extraction works fine: Digitally created PDFs (not scans) with clean text layers. Structured documents with consistent formats. Pure text content with no meaningful visual layout.
Cases where native multimodal is necessary: Scanned documents where OCR accuracy is below 90%. Documents where layout carries information (tables, forms, charts). Images where the visual content is the information (photos, diagrams, screenshots). Audio where tone or context is meaningful beyond the transcript. Video where temporal visual information matters.
The deeper reason multimodal models outperform extraction pipelines: they were trained on image-text pairs at enormous scale, learning to understand visual content in context. When GPT-4o sees a freight invoice, it has implicitly learned the layout conventions of invoices from training data — it understands that the number in the upper right of a specific box is the invoice number, even if there is no text label next to it.
The embedding space insight: Multimodal models project images, text, and sometimes audio into a shared embedding space where semantically related content is close together regardless of modality. This is how CLIP (Contrastive Language-Image Pretraining) works — a photo of a dog and the text "a dog playing in the park" have similar embeddings. This shared space enables cross-modal search: find images that match a text description, find text that describes a given image, without explicit annotation of the relationship.
The engineering implication: if you are building a search system over a mixed corpus (documents, images, data), native multimodal embeddings often outperform any extraction-then-embed pipeline because they preserve the modality-specific context that extraction destroys.
Multimodal AI spans from simple image description to complex cross-modal reasoning and generation pipelines.
1import base642from openai import OpenAI3from pathlib import Path4from pydantic import BaseModel56client = OpenAI()7Level 1: Direct image → structured data. No OCR preprocessing step needed for vision models.8# Level 1: Document understanding with vision models9# Extract structured data from images without preprocessing OCR stepPydantic model for structured output: forces the model to extract data into specific typed fields1011class InvoiceData(BaseModel):12invoice_number: str13vendor_name: str14invoice_date: str15line_items: list[dict] # [{description, qty, unit_price, total}]16subtotal: float17tax_amount: float18total_amount: float19payment_terms: str | None2021def extract_invoice_data(image_path: str) -> InvoiceData:22"""Extract structured data from an invoice image."""base64 encoding: inline image data in the API call. For large images, use URL references.23image_bytes = Path(image_path).read_bytes()24image_b64 = base64.b64encode(image_bytes).decode()2526response = client.beta.chat.completions.parse(27model="gpt-4o",28messages=[29{30"role": "user",31"content": ["detail": "high" — use for documents with small text or dense tables. "low" for thumbnails, charts.32{33"type": "image_url",34"image_url": {35"url": f"data:image/jpeg;base64,{image_b64}",36"detail": "high", # High detail for dense documents37}38},39{40"type": "text",response_format=InvoiceData: structured output forcing ensures parseable JSON with correct types41"text": "Extract all invoice data from this image. Be precise with numbers.",42}43]44}45],46response_format=InvoiceData,47)48return response.choices[0].message.parsed
1import clip2import torch3import numpy as np4from PIL import Image5from typing import Union67# Level 2: Cross-modal semantic search with CLIP embeddings8# Find images from text queries or text from image queries9# Same embedding space enables bidirectional searchCLIP: joint image-text embedding model. Text and images share the same vector space.1011device = "cuda" if torch.cuda.is_available() else "cpu"12model, preprocess = clip.load("ViT-B/32", device=device)1314def embed_image(image_path: str) -> np.ndarray:embed_image and embed_text produce vectors in the SAME space — enabling cross-modal search15"""Create CLIP embedding for an image."""16image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)17with torch.no_grad():18embedding = model.encode_image(image)19return embedding.cpu().numpy().flatten()2021def embed_text(text: str) -> np.ndarray:22"""Create CLIP embedding for text (same space as images!)"""23tokens = clip.tokenize([text]).to(device)24with torch.no_grad():25embedding = model.encode_text(tokens)26return embedding.cpu().numpy().flatten()2728def cross_modal_search(29query: Union[str, str], # Either text query or image path30index: dict, # {"image_path": embedding} indexNo annotation needed between image and text — the shared space enables zero-shot cross-modal search31top_k: int = 532) -> list[tuple[str, float]]:33"""Search images by text or text by images using CLIP embeddings."""34if query.endswith((".jpg", ".png", ".webp")):Query can be text or image: the search works bidirectionally without special handling35query_embedding = embed_image(query)36else:37query_embedding = embed_text(query)3839# Cosine similarity against all indexed embeddings40similarities = []41for path, embedding in index.items():42cos_sim = np.dot(query_embedding, embedding) / (43np.linalg.norm(query_embedding) * np.linalg.norm(embedding)44)45similarities.append((path, float(cos_sim)))4647return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]4849# Example: search product image catalog by text description50# results = cross_modal_search("red running shoes with white sole", image_index)51# Result: top 5 images most semantically matching the description
1from openai import OpenAI2import requests3from pathlib import Path4import base6456client = OpenAI()78# Level 3: Multimodal generation pipeline9# Generate images from text, edit images with text instructions,10# create image variations — combined into a product workflowLevel 3: Full generation + editing + QA pipeline — each step uses a different multimodal capability1112def generate_product_image(13product_description: str,14style: str = "clean white background, professional product photography",15size: str = "1024x1024"16) -> str:17"""Generate a product image from description text."""18prompt = f"{product_description}. Style: {style}"1920response = client.images.generate(21model="dall-e-3",22prompt=prompt,quality="hd": DALL-E 3 HD doubles cost but produces significantly sharper images23size=size,24quality="hd", # HD: 2x cost, 2x resolution25style="natural", # "natural" vs "vivid" (more saturated/dramatic)26)27return response.data[0].url2829def edit_image_with_instruction(30image_path: str,31edit_instruction: str,32mask_path: str | None = None33) -> str:Image editing requires a mask: specifies which pixels can be changed (transparent = editable)34"""Edit an existing image based on natural language instruction."""35with open(image_path, "rb") as f:36image_data = f.read()3738# DALL-E 3 edits (inpainting): requires mask for selective editing39response = client.images.edit(40model="dall-e-2", # DALL-E 2 supports edits; DALL-E 3 does not yet41image=image_data,42mask=open(mask_path, "rb") if mask_path else None,43prompt=edit_instruction,44size="1024x1024",45n=1,46)47return response.data[0].urlVision QA loop: use GPT-4o to verify generated images meet requirements before serving to users4849def vision_quality_check(image_url: str, quality_criteria: str) -> dict:50"""Use vision model to QA generated images against product criteria."""51response = client.chat.completions.create(52model="gpt-4o",53messages=[{54"role": "user",55"content": [56{"type": "image_url", "image_url": {"url": image_url}},57{"type": "text", "text": f"Does this image meet these criteria? {quality_criteria} Reply with JSON: {{passed: bool, issues: [str]}}"}58]59}],60response_format={"type": "json_object"},61)62import json63return json.loads(response.choices[0].message.content)
Multimodal capabilities unlock fundamentally different product possibilities. Here are three production patterns with distinct engineering challenges.
Pattern 1: Document Intelligence (Image → Structured Data) A healthcare company processes insurance prior authorization forms — 500 per day, each a scanned PDF with handwritten notes and checked boxes. OCR achieves 82% field accuracy. GPT-4o with vision achieves 94%. The critical engineering challenge: validating extracted data. Solution: extract once with GPT-4o, then use a deterministic validation layer (check that medication codes are valid ICD-10, that quantities are within reasonable ranges, that dates are in the correct format). The vision model extracts; the business logic layer validates. Human review queue handles the 6% that fail validation. This reduces manual review from 500 to 30 forms/day.
Pattern 2: Product Image Search (Image ↔ Text Cross-Modal) An e-commerce platform has 10 million product images. Users want to search by image ("find things that look like this") or by description ("minimalist white desk lamp"). CLIP embeddings enable both: embed all images and their text descriptions into a shared vector space. Queries can be text (embed query text, find nearest image embeddings) or image (embed query image, find nearest image/text embeddings). The engineering challenge: index freshness — 50,000 new products per day require re-embedding. Solution: streaming embedding pipeline with Kafka, products embedded within 15 minutes of upload. Image search converts from non-existent to a 32% conversion rate improvement for visual product categories.
Pattern 3: Audio Transcription with Context (Audio → Enriched Text) A legal firm transcribes depositions — 4-8 hours of audio, multiple speakers, legal terminology. Whisper (OpenAI's speech model) achieves 95% word accuracy on clean audio. The challenge: speaker diarization (who said what?) and domain vocabulary (legal terms, case names, acronyms). Production solution: Whisper for transcription + GPT-4o for post-processing with the case file as context. GPT-4o corrects domain vocabulary using case file context, adds speaker labels using deposition structure knowledge, and flags legally significant statements for attorney review. Turnaround time drops from 3 days (human transcription) to 2 hours (automated) with equivalent accuracy.
Three multimodal failure patterns that catch teams who treat vision as just another input type.
Senior engineers ask: "How do we add image support?" Principal engineers ask: "What modalities does the information we work with actually exist in, and how do we build a system that works natively with each?"
The principal insight is that modality is a first-class concern in information architecture. A document processing system that treats PDFs as text-extractable containers will fail on 20-30% of real-world documents. A product catalog that cannot be searched by image will miss the 60% of purchase intents that start with a visual reference. Modality is not an edge case — it is how information exists in the world.
The frontier in 2025: video understanding is the next major multimodal frontier. GPT-4o with video understanding can analyze meeting recordings, product demos, and surveillance footage. The engineering challenge is temporal reasoning — understanding what happened across a 30-minute recording, not just analyzing individual frames. Models that can reason across temporal sequences of images will unlock entirely new categories of applications in content moderation, security, and enterprise workflow automation.
The five-year arc: multimodal models will become the default rather than the exception. The distinction between "AI that can see" and "AI" will dissolve as vision capability becomes as standard as text understanding. The engineering challenge will shift from "can we use vision models" to "how do we build reliable extraction, validation, and generation pipelines for the modalities our business requires."
Common questions:
Strong answer: Immediately asks about document quality distribution and production vs. test set representativenessUnderstands CLIP's shared embedding space and can design cross-modal search architectureKnows the cost structure of vision API calls (tile-based pricing, detail levels) and can estimate production costsDesigns extraction + validation pipelines where the vision model extracts and deterministic logic validates
Red flags: Treats multimodal as "just add an image to the prompt" without understanding cost, accuracy, and validation requirementsAssumes vision model accuracy is equivalent to text model accuracy without understanding quality sensitivityNo mention of validation layers for vision extraction — trusts raw model output for business-critical dataUses academic image classification accuracy (ImageNet) as a proxy for document extraction accuracy
Scenario · An accounts payable automation startup
Step 1 of 2
Your company processes 10,000 supplier invoices per month. 80% are PDF scans of varying quality. You need to extract: vendor name, invoice number, date, line items, and total. Current process: manual entry by 5 AP clerks taking 15 minutes per invoice.
You need to build the extraction pipeline. You have three options for the core extraction: traditional OCR + text NLP, GPT-4o vision, or a specialized invoice extraction API (e.g., AWS Textract). Each has tradeoffs. Your priority is accuracy (>95% field-level accuracy) with acceptable cost (<$0.05/invoice).
Which extraction approach do you choose?
Quick check · Multimodal AI Systems
1 / 4
Key takeaways
Why does native multimodal extraction outperform OCR + NLP for document understanding?
Vision models understand document layout, visual structure, and context holistically. OCR extracts text sequentially, destroying spatial relationships (which column a number is in, which label corresponds to which field). Vision models trained on document-text pairs learn layout conventions implicitly.
How does CLIP enable searching images with text without explicit image annotation?
CLIP trains on 400M (image, text) pairs from the internet, creating a joint embedding space where semantically related images and text have similar embeddings. Text query "red shoes" and an image of red shoes map to nearby vectors. No explicit annotation needed — similarity is learned from the training distribution.
What validation layers are required for production vision extraction of financial data?
Range checks (is the extracted amount within a plausible range for the document type?), cross-field consistency (does total = subtotal + tax?), format validation (are dates parseable, are codes valid?), and confidence thresholding (route low-confidence extractions to human review). Never trust raw vision extraction for financial data without all four.
From the books
Learning Multimodal AI — Chip Huyen (2024)
Huyen's key insight on multimodal systems: "Modality is not just an input format — it is an information structure. The visual layout of a document, the spatial relationship of objects in an image, the temporal sequence of audio — these are all information that exists in the modality and is destroyed by extraction to text. Multimodal models are valuable precisely because they preserve this structure."
Learning Transferable Visual Models From Natural Language Supervision (CLIP) — Radford et al., OpenAI (2021)
CLIP's key contribution: training on 400 million (image, text) pairs from the internet creates a joint embedding space where image-text similarity is semantically meaningful without task-specific fine-tuning. The zero-shot transfer capability — searching images with text without explicit annotation — emerges from the scale and diversity of the training distribution.
GPT-4 Technical Report — OpenAI (2023)
The GPT-4 multimodal section demonstrates that visual understanding emerges from the same training paradigm as language understanding — not from specialized vision architectures. The implication: as language models scale, their visual understanding improves as a byproduct, suggesting that the gap between text-only and multimodal capability will narrow as foundation models improve.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.