Context Engineering & Prompt Caching

On this page

The trap of the giant prompt
The principle: context is a budget
The picture: prefix, retrieval, task
What belongs where
Structuring a prompt for cache reuse
Context rot and lost-in-the-middle
Common mistakes that cost hours
Takeaways
Where to go next

TL;DR

With million-token windows, the skill is curating what goes in, not clever wording. Treat context as a budget, order it for prompt caching to win 5-10x on cost and latency, and design around context rot.

The trap of the giant prompt

The context window grew, so you fed it. The whole knowledge base, the full chat history, every tool result, three example documents "just in case." The window said 1,000,000 tokens, so surely more is better. Then the bill arrived, the latency doubled, and, the part nobody warns you about, the answers got worse. The model started ignoring the one paragraph that mattered and confidently citing a stale fact you stuffed in at position 400,000.

This is the counterintuitive truth of modern LLM work: stuffing everything into a huge prompt makes it slower, pricier, and dumber. A bigger window is not a license to dump, it is a bigger surface area for the model to get lost in. The skill that used to be "prompt engineering" (finding the magic phrasing) has been quietly replaced by context engineering: deciding what to put in front of the model, in what order, and what to deliberately leave out.

Who this is for

Engineers building RAG pipelines, agents, or any LLM feature where you assemble the prompt yourself. If you have ever wondered why your app is slow and expensive despite a "big enough" context window, or why quality dropped when you added more docs, this is for you. No ML background required; if you can build an API call, you can do this.

The principle: context is a budget

Context is a budget, not a bucket. Every token you spend competes for the model's attention, curate it, don't fill it.

The window size is the *limit*, not the *target*. Treat the prompt like a carry-on bag, not a moving truck: it has a size, but the goal is to pack only what the trip needs. Two costs scale with what you pack, money (you pay per input token) and attention (the model spreads a fixed budget of focus across everything you include). Add a low-value document and you have not just wasted cents; you have diluted the model's attention on the high-value sentence next to it.

Packing a carry-on for a tripChoosing which docs/history/tools enter the prompt

A desk with only the files for today's task on itA curated context, signal, no clutter

A briefing memo: summary on top, appendix at the backOrdering, put the critical instructions where attention is highest

A reusable cover letter you tweak per jobA stable prefix that gets cached and reused across calls

Context engineering maps onto things you already manage every day.

The picture: prefix, retrieval, task

A well-engineered prompt has three layers assembled in a fixed order. The stable prefix (system instructions, tool definitions, few-shot examples) never changes between calls, so the provider can cache it. The retrieved context (the handful of documents relevant to *this* question) changes per request. The task (the user's actual question) comes last. On a cache hit, the prefix is read from the provider's cache instead of being reprocessed, which is where the 5–10x latency and cost win comes from.

Stable prefix is cached and reused; only retrieval + task are fresh each call.

1
Pin the stable prefix
System prompt, tool schemas, and few-shot examples go first and stay byte-for-byte identical across requests.
2
Mark the cache breakpoint
Tell the provider where the reusable prefix ends so it can store and reuse everything up to that point.
3
Retrieve, don't dump
Fetch only the top-k documents relevant to this question and append them after the prefix.
4
Place the task last
End with the user's actual question so the model's freshest attention lands on what you want answered.
5
Reuse on the next call
Same prefix → cache hit. You pay full price only for the retrieval + task tail.

What belongs where

The hardest decision is triage: for every piece of information, decide whether it lives *in* the context, gets *retrieved* on demand, or stays *out* entirely. The default instinct is to include everything, fight it. Here is a working rule of thumb.

Bucket	What goes here	Why
In context (stable prefix)	System instructions, role, output format, tool definitions, 2–3 gold-standard examples	Needed on every call and unchanging, perfect for caching
Retrieved per request	The top-k documents, the relevant rows, the specific file the user named, recent turns that matter	Relevance changes per question; pull only what this query needs
Leave out	The full knowledge base, entire chat history, low-relevance "just in case" docs, redundant boilerplate	Costs tokens + attention, dilutes signal, invites context rot

Triage every piece of information into one of three buckets.

Pro tip

When in doubt, leave it out and measure. It is far easier to add a document back after you see quality dip than to notice that one of fifty docs is quietly poisoning every answer.

Structuring a prompt for cache reuse

Caching only works if the prefix is identical byte-for-byte across calls. Anything that changes early, a timestamp, a per-user greeting, a reordered tool list, busts the cache for everything after it. The pattern below keeps the volatile parts at the tail so the expensive prefix stays cacheable.

prompt.py

python

from anthropic import Anthropic

client = Anthropic()

# STABLE PREFIX, identical on every call, so it can be cached.
SYSTEM = [
    {
        "type": "text",
        "text": (
            "You are a support assistant for Acme Cloud.\n"
            "Answer only from the provided context. If unsure, say so.\n"
            "Format: a one-line answer, then a short why."
        ),
        # Mark the end of the reusable prefix.
        "cache_control": {"type": "ephemeral"},
    },
]

def answer(question: str, docs: list[str]) -> str:
    # RETRIEVED CONTEXT changes per request, append AFTER the stable prefix.
    retrieved = "\n\n".join(f"<doc>{d}</doc>" for d in docs)
    return client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        system=SYSTEM,                      # cached prefix
        messages=[
            # Docs first, then the TASK last so attention lands on it.
            {"role": "user", "content": f"{retrieved}\n\nQuestion: {question}"},
        ],
    ).content[0].text

# Second call with the same SYSTEM = cache hit on the prefix.
print(answer("How do I rotate an API key?", retrieved_docs))

Two things make this work. First, the system prompt carries a cache_control breakpoint, so the provider stores everything up to it. Second, nothing volatile, no "Today is " + date, no per-user name, sits inside the cached block. For deeper cost mechanics, see LLM Cost & Latency Optimization.

Context rot and lost-in-the-middle

Models do not read context evenly. They attend strongly to the beginning and the end of the window, and weakly to the middle, the well-documented "lost-in-the-middle" effect. Bury a critical instruction at token 300,000 of a million-token prompt and the model may simply not act on it, even though it is technically "in context."

Context rot is the slower cousin. As you keep appending, every tool result, every prior turn, every retrieved chunk, the ratio of signal to noise falls. Old, stale, or contradictory snippets accumulate, and the model starts blending them into the answer. The window is not full, but it is *polluted*. The fix is the same as the cause's opposite: curate. Trim history, summarize old turns, retrieve fresh, and put the must-not-miss instruction at the top or the bottom, never the middle.

Watch out

"It is in the context" does not mean "the model used it." Position matters as much as presence. Test by moving a key instruction to the end, if quality jumps, you had a lost-in-the-middle problem.

Common mistakes that cost hours

1Everything in context. Dumping the whole KB or full history because the window is big. You pay for every token and dilute attention, retrieve top-k instead.
2Cache-busting prefixes. Putting a timestamp, per-user greeting, or reordered tool list at the start. It changes the prefix every call, so nothing is ever a cache hit. Keep volatile data at the tail.
3No ordering. Scattering instructions through the middle of a long prompt. Put the must-follow rules at the top and the task at the very end where attention is highest.
4Forgetting the cache TTL. Provider caches expire (often minutes). A prefix only stays cached if you keep using it; spaced-out calls re-pay the write cost.
5Retrieving too much. Top-50 instead of top-5. More chunks means more noise and a higher chance the relevant one gets lost in the middle.

Takeaways

The whole article in seven lines

A 1M-token window is a limit, not a target, context is a budget, curate it.
Three layers, fixed order: stable prefix → retrieved context → task (last).
Triage every fact: in context, retrieve on demand, or leave out.
Prompt caching reuses an identical prefix for 5–10x cost/latency wins.
Keep the cached prefix byte-for-byte identical; push volatile data to the tail.
Models lose the middle, put critical instructions at the top or bottom.
Context rot is creeping noise; trim, summarize, and retrieve fresh.

Where to go next

Context engineering sits between the basics of writing a good prompt and the discipline of running LLMs cheaply at scale. Read these next to round it out.

Prompt Engineering Fundamentals, the wording and structuring skills context engineering builds on.
LLM Cost & Latency Optimization, caching economics, batching, and model selection in depth.
Follow the full AI Engineer career path to see where context engineering fits among retrieval, evals, and agents.

Check your understanding

1. According to the article, what is the main reason that stuffing a huge prompt full of everything makes answers worse?

2. Why does the article place the stable prefix (system instructions, tool definitions, few-shot examples) first in the prompt?

Frequently asked questions

If my context window is huge, why not just put everything in it?

Stuffing everything into a giant prompt makes it slower, pricier, and dumber. A bigger window is a bigger surface area for the model to get lost in, and adding a low-value document dilutes the model's attention on the high-value sentence next to it.

What is context engineering?

It is deciding what to put in front of the model, in what order, and what to deliberately leave out. It has quietly replaced "prompt engineering," which focused on finding the magic phrasing, because context is a budget, not a bucket.

How does prompt caching deliver a 5 to 10x win?

A well-engineered prompt puts a stable prefix first, such as system instructions, tool definitions, and few-shot examples, which never changes between calls. On a cache hit that prefix is read from the provider's cache instead of being reprocessed, which is where the latency and cost win comes from.

In what order should I assemble a prompt?

Use three layers in a fixed order: the stable prefix first, then the retrieved context relevant to this specific question, then the user's actual task last. The fixed order is what lets the provider cache the unchanging prefix.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read

Context Engineering & Prompt Caching

01The trap of the giant prompt

02The principle: context is a budget

03The picture: prefix, retrieval, task

04What belongs where

05Structuring a prompt for cache reuse

06Context rot and lost-in-the-middle

07Common mistakes that cost hours

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

RAG Architecture Explained for Backend Engineers

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

The trap of the giant prompt

The principle: context is a budget

The picture: prefix, retrieval, task

What belongs where

Structuring a prompt for cache reuse

Context rot and lost-in-the-middle

Common mistakes that cost hours

Takeaways

Where to go next