Back to Blog
AI Engineering13 min readJun 2026

LLM Cost & Latency Optimization

Your demo worked, but the bill and the p95 latency did not. A practical playbook for making LLM apps cheap and fast enough to ship: budgeting, caching, model routing, and streaming.

AILLMCostLatency
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

Your demo worked. The bill and the p95 did not.

The demo was magic. You wired up a model, pasted in a fat system prompt, stuffed the whole knowledge base into the context, and it answered beautifully. Then you shipped it to real traffic. Two things broke at once: the monthly bill climbed past what the feature is worth, and the p95 latency, the slow-tail experience your impatient users actually feel, crept toward six, eight, ten seconds. Suddenly the magic feels like a liability.

Here is the good news: almost none of this requires a smarter model. It requires *engineering*. The same disciplines you already apply to databases and APIs, caching, routing, trimming, batching, map cleanly onto LLM calls. This article is the playbook: the levers that make an LLM app cheap and fast enough to keep in production, in the order you should reach for them.

Who this is for

Engineers who have an LLM feature *working* and now need it to be **affordable and responsive** at scale. You should be comfortable reading Python and thinking about percentiles and per-request cost. No ML background required, this is systems work, not model training. New to how these models tick? Start with [How LLMs Actually Work](/blog/how-llms-actually-work).

The principle: don't send the token in the first place

The cheapest, fastest token is the one you never send.
The whole article in one line

Every optimization here is a variation on that idea. You pay, in dollars *and* in milliseconds, roughly in proportion to the tokens that flow in and out of the model. Input tokens cost money to send and add to the time-to-first-token. Output tokens cost more per unit *and* are generated one at a time, so they dominate latency. So the entire game is: send fewer tokens, generate fewer tokens, and avoid the call entirely when you can.

Crucially, optimizing cost and optimizing latency are not always the same move. Some levers buy you both (a cache hit is free *and* instant). Some buy one at the expense of the other (batching is cheaper but slower per request). Knowing which is which is what lets you tune deliberately instead of flailing.

A pricey consultant who bills by the wordThe LLM, every input and output token costs money and time
Checking your notes before booking the meetingA cache lookup before calling the model at all
Asking the junior analyst the easy questions firstRouting simple requests to a small, cheap model
Sending a one-page brief instead of the whole filing cabinetTrimming the prompt and context to what's relevant
The consultant talking while they think, not afterStreaming tokens so the user sees output immediately
Treat the model like an expensive specialist consultant, not a search box you spam.

The picture: a request's journey to the cheapest answer

Before reaching for any single model, a well-built LLM service runs each request through a funnel. Every stage is a chance to answer *cheaper* or *not at all*. Only the requests that survive every filter reach the expensive large model.

askcache hitmisseasyhardstreamstreamwrite-back
Client

User request

Semantic Cache

Embedding similarity

Router

Classify difficulty

Small Model

Cheap & fast

Large Model

Escalation only

Response

Streamed tokens

A request flows through a semantic cache and a router before ever touching a model; streaming carries the answer back token-by-token.

  1. 1

    Check the semantic cache

    Embed the incoming request and look for a near-identical question answered before. On a hit, return the stored answer instantly, zero model tokens, single-digit-millisecond latency.

  2. 2

    Route on a cache miss

    A lightweight classifier (rules, a tiny model, or a heuristic) decides whether this is an easy request or a hard one. Most production traffic is easy.

  3. 3

    Try the small model first

    Send easy requests to a small, cheap, fast model. It handles the long tail of simple lookups, classifications, and short rewrites for a fraction of the cost.

  4. 4

    Escalate only when needed

    Hard requests, or small-model answers that fail a confidence/validation check, escalate to the large model. You pay the premium only for the requests that truly need it.

  5. 5

    Stream the answer back

    Whichever model answers, stream tokens to the client as they generate. Total time is unchanged, but perceived latency collapses because the user sees words immediately.

  6. 6

    Write back to the cache

    Store the answer keyed by the request embedding so the next similar question skips the whole funnel.

The levers, and what each one actually buys you

These are the six moves worth knowing, ranked roughly by impact-per-effort. The honest framing matters: each lever has a trade-off, and stacking them blindly can hurt. Read the table as a menu, not a checklist.

LeverSaves cost?Saves latency?Trade-off
Exact / prompt cachingYes, cached input is heavily discountedYes, skips reprocessingOnly helps on repeated prefixes; needs stable prompt ordering
Semantic cachingYes, full call avoided on hitYes, instant on hitRisk of serving a stale or subtly-wrong match; tune the threshold
Model routing (small first)Yes, most traffic on cheap modelYes, small models respond fasterRouting mistakes send hard queries to a weak model; needs a fallback
Prompt / context trimmingYes, fewer input tokensYes, less to processTrim too aggressively and you cut the context the answer needed
Batching requestsYes, better throughput per dollarNo, adds queueing delayWrong for interactive UX; great for offline / bulk jobs
Smaller model outrightYes, lower per-token priceYes, faster generationQuality drop on complex tasks; verify on your eval set
StreamingNo, same tokens billedPerceived only, same total timeMore complex client code; partial output can mislead if it errors mid-stream
Pick levers by what you're optimizing, cost, latency, or both, and budget for the trade-off.

Pro tip

Reach for the levers that buy **both** axes first, caching, routing, trimming, a smaller model. Streaming and batching are special-purpose: streaming fixes *perceived* latency for interactive apps; batching trades latency for cost in offline pipelines.

A minimal router and token-budget trimmer

Here is a stripped-down version of the funnel as code: a token-budget trimmer that keeps context under a hard cap, and a router that tries a small model first and escalates only when the answer looks weak. The point is the *shape*, swap in your own client, models, and validation.

router.py
python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

def token_len(text: str) -> int:
    return len(enc.encode(text))

def trim_to_budget(chunks: list[str], budget: int) -> list[str]:
    """Keep the most relevant chunks first; drop the rest once the budget is spent."""
    kept, used = [], 0
    for chunk in chunks:  # chunks arrive pre-sorted by relevance
        cost = token_len(chunk)
        if used + cost > budget:
            break  # the cheapest token is the one you never send
        kept.append(chunk)
        used += cost
    return kept

SMALL, LARGE = "small-fast-model", "large-strong-model"

def looks_weak(answer: str) -> bool:
    """Cheap heuristics to decide if we must escalate."""
    flags = ("i'm not sure", "i cannot", "as an ai")
    return len(answer) < 20 or any(f in answer.lower() for f in flags)

def answer(question: str, context_chunks: list[str], client) -> str:
    context = "\n\n".join(trim_to_budget(context_chunks, budget=1500))
    prompt = f"Context:\n{context}\n\nQuestion: {question}"

    # 1) try the small model first
    draft = client.complete(model=SMALL, prompt=prompt, max_tokens=300)

    # 2) escalate only if the cheap answer is weak
    if looks_weak(draft):
        return client.complete(model=LARGE, prompt=prompt, max_tokens=300)
    return draft

Two details earn their keep. First, trim_to_budget enforces a hard input ceiling, without it, a few oversized retrieved chunks silently double your bill on every call. Second, looks_weak is deliberately crude; in production you'd validate against the task (did it return valid JSON? did it cite a source?) rather than sniffing for phrases. The architecture is identical to what you'd build for an AI agent: a cheap default path with a deliberate escalation rule.

Common mistakes that quietly drain the budget

  1. Always using the biggest model. The flagship is the most expensive and the slowest. Most production traffic is easy and a small model handles it fine, you just never measured the split. Route first; reserve the big model for the requests that earn it.
  2. No caching of any kind. Real traffic is repetitive: the same questions, the same system prompt, the same retrieved chunks. Without exact, prompt, or semantic caching you re-pay for identical work all day. Caching is usually the single highest-leverage change you can ship.
  3. Ignoring output tokens. Teams obsess over shrinking the prompt and forget that output tokens cost more *and* dominate latency. A max_tokens cap and a prompt that says "answer in two sentences" often saves more than any input trimming.
  4. Letting context grow unbounded. Stuffing the whole knowledge base "just in case" inflates every single call. Retrieve and trim to a token budget; relevance beats volume.
  5. Optimizing latency you can't feel. Chasing total generation time when the real fix is *perceived* latency. Stream the response and the same model feels twice as fast.
  6. Tuning blind. Shipping levers without per-request cost and p50/p95 latency dashboards. If you can't see the bill and the tail, you can't tell whether a change helped or hurt.

Takeaways

The whole article in seven lines

  • The cheapest, fastest token is the one you never send, that single idea drives every lever.
  • You pay in dollars and milliseconds roughly in proportion to input and output tokens.
  • Run requests through a funnel: semantic cache, then router, then small model, then large model only on escalation.
  • Caching, routing, trimming, and a smaller model save **both** cost and latency, reach for these first.
  • Batching trades latency for cost (offline only); streaming trades nothing but transforms *perceived* latency.
  • Output tokens cost more and dominate latency, cap them and ask for shorter answers.
  • You cannot optimize what you cannot see: instrument per-request cost and p50/p95 before tuning.

Where to go next

Cost and latency optimization is the final discipline that turns an LLM demo into a shippable product. The strongest next step is to wire a real dashboard, per-request cost, cache hit rate, and the small-vs-large routing split, so every lever you add is measured, not guessed.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.