Context windows hit 1M+ tokens, so the skill stopped being clever wording and became curating what goes in. Learn context engineering, ordering, prompt caching for 5–10x cost/latency wins, and how to dodge context rot.
The context window grew, so you fed it. The whole knowledge base, the full chat history, every tool result, three example documents "just in case." The window said 1,000,000 tokens, so surely more is better. Then the bill arrived, the latency doubled, and, the part nobody warns you about, the answers got worse. The model started ignoring the one paragraph that mattered and confidently citing a stale fact you stuffed in at position 400,000.
This is the counterintuitive truth of modern LLM work: stuffing everything into a huge prompt makes it slower, pricier, and dumber. A bigger window is not a license to dump, it is a bigger surface area for the model to get lost in. The skill that used to be "prompt engineering" (finding the magic phrasing) has been quietly replaced by context engineering: deciding what to put in front of the model, in what order, and what to deliberately leave out.
Who this is for
Engineers building RAG pipelines, agents, or any LLM feature where you assemble the prompt yourself. If you have ever wondered why your app is slow and expensive despite a "big enough" context window, or why quality dropped when you added more docs, this is for you. No ML background required; if you can build an API call, you can do this.
The principle: context is a budget
Context is a budget, not a bucket. Every token you spend competes for the model's attention, curate it, don't fill it.
The window size is the *limit*, not the *target*. Treat the prompt like a carry-on bag, not a moving truck: it has a size, but the goal is to pack only what the trip needs. Two costs scale with what you pack, money (you pay per input token) and attention (the model spreads a fixed budget of focus across everything you include). Add a low-value document and you have not just wasted cents; you have diluted the model's attention on the high-value sentence next to it.
Packing a carry-on for a tripChoosing which docs/history/tools enter the prompt
A desk with only the files for today's task on itA curated context, signal, no clutter
A briefing memo: summary on top, appendix at the backOrdering, put the critical instructions where attention is highest
A reusable cover letter you tweak per jobA stable prefix that gets cached and reused across calls
Context engineering maps onto things you already manage every day.
The picture: prefix, retrieval, task
A well-engineered prompt has three layers assembled in a fixed order. The stable prefix (system instructions, tool definitions, few-shot examples) never changes between calls, so the provider can cache it. The retrieved context (the handful of documents relevant to *this* question) changes per request. The task (the user's actual question) comes last. On a cache hit, the prefix is read from the provider's cache instead of being reprocessed, which is where the 5–10x latency and cost win comes from.
Stable prefix is cached and reused; only retrieval + task are fresh each call.
1
Pin the stable prefix
System prompt, tool schemas, and few-shot examples go first and stay byte-for-byte identical across requests.
2
Mark the cache breakpoint
Tell the provider where the reusable prefix ends so it can store and reuse everything up to that point.
3
Retrieve, don't dump
Fetch only the top-k documents relevant to this question and append them after the prefix.
4
Place the task last
End with the user's actual question so the model's freshest attention lands on what you want answered.
5
Reuse on the next call
Same prefix → cache hit. You pay full price only for the retrieval + task tail.
What belongs where
The hardest decision is triage: for every piece of information, decide whether it lives *in* the context, gets *retrieved* on demand, or stays *out* entirely. The default instinct is to include everything, fight it. Here is a working rule of thumb.
Bucket
What goes here
Why
In context (stable prefix)
System instructions, role, output format, tool definitions, 2–3 gold-standard examples
Needed on every call and unchanging, perfect for caching
Retrieved per request
The top-k documents, the relevant rows, the specific file the user named, recent turns that matter
Relevance changes per question; pull only what this query needs
Leave out
The full knowledge base, entire chat history, low-relevance "just in case" docs, redundant boilerplate
Triage every piece of information into one of three buckets.
Pro tip
When in doubt, leave it out and measure. It is far easier to add a document back after you see quality dip than to notice that one of fifty docs is quietly poisoning every answer.
Structuring a prompt for cache reuse
Caching only works if the prefix is identical byte-for-byte across calls. Anything that changes early, a timestamp, a per-user greeting, a reordered tool list, busts the cache for everything after it. The pattern below keeps the volatile parts at the tail so the expensive prefix stays cacheable.
prompt.py
python
from anthropic import Anthropic
client = Anthropic()
# STABLE PREFIX, identical on every call, so it can be cached.
SYSTEM = [
{
"type": "text",
"text": (
"You are a support assistant for Acme Cloud.\n""Answer only from the provided context. If unsure, say so.\n""Format: a one-line answer, then a short why."
),
# Mark the end of the reusable prefix."cache_control": {"type": "ephemeral"},
},
]
defanswer(question: str, docs: list[str]) -> str:
# RETRIEVED CONTEXT changes per request, append AFTER the stable prefix.
retrieved = "\n\n".join(f"<doc>{d}</doc>"for d in docs)
return client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
system=SYSTEM, # cached prefix
messages=[
# Docs first, then the TASK last so attention lands on it.
{"role": "user", "content": f"{retrieved}\n\nQuestion: {question}"},
],
).content[0].text
# Second call with the same SYSTEM = cache hit on the prefix.print(answer("How do I rotate an API key?", retrieved_docs))
Two things make this work. First, the system prompt carries a cache_control breakpoint, so the provider stores everything up to it. Second, nothing volatile, no "Today is " + date, no per-user name, sits inside the cached block. For deeper cost mechanics, see LLM Cost & Latency Optimization.
Context rot and lost-in-the-middle
Models do not read context evenly. They attend strongly to the beginning and the end of the window, and weakly to the middle, the well-documented "lost-in-the-middle" effect. Bury a critical instruction at token 300,000 of a million-token prompt and the model may simply not act on it, even though it is technically "in context."
Context rot is the slower cousin. As you keep appending, every tool result, every prior turn, every retrieved chunk, the ratio of signal to noise falls. Old, stale, or contradictory snippets accumulate, and the model starts blending them into the answer. The window is not full, but it is *polluted*. The fix is the same as the cause's opposite: curate. Trim history, summarize old turns, retrieve fresh, and put the must-not-miss instruction at the top or the bottom, never the middle.
Watch out
"It is in the context" does not mean "the model used it." Position matters as much as presence. Test by moving a key instruction to the end, if quality jumps, you had a lost-in-the-middle problem.
Common mistakes that cost hours
Everything in context. Dumping the whole KB or full history because the window is big. You pay for every token and dilute attention, retrieve top-k instead.
Cache-busting prefixes. Putting a timestamp, per-user greeting, or reordered tool list at the start. It changes the prefix every call, so nothing is ever a cache hit. Keep volatile data at the tail.
No ordering. Scattering instructions through the middle of a long prompt. Put the must-follow rules at the top and the task at the very end where attention is highest.
Forgetting the cache TTL. Provider caches expire (often minutes). A prefix only stays cached if you keep using it; spaced-out calls re-pay the write cost.
Retrieving too much. Top-50 instead of top-5. More chunks means more noise and a higher chance the relevant one gets lost in the middle.
Takeaways
The whole article in seven lines
A 1M-token window is a limit, not a target, context is a budget, curate it.
Triage every fact: in context, retrieve on demand, or leave out.
Prompt caching reuses an identical prefix for 5–10x cost/latency wins.
Keep the cached prefix byte-for-byte identical; push volatile data to the tail.
Models lose the middle, put critical instructions at the top or bottom.
Context rot is creeping noise; trim, summarize, and retrieve fresh.
Where to go next
Context engineering sits between the basics of writing a good prompt and the discipline of running LLMs cheaply at scale. Read these next to round it out.
Follow the full AI Engineer career path to see where context engineering fits among retrieval, evals, and agents.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.