Context Engineering: The Skill That Replaced Prompt Engineering in 2026
Andrej Karpathy was right — 'prompt engineering' was always the wrong frame. Context engineering is the practice of constructing the optimal input for a language model at runtime. Here's what that actually means in production, and why it explains most of the gap between AI demos and AI products.
Jordan Reeves
Developer Experience Lead
Andrej Karpathy put it plainly: "Prompt engineering" is a misleading term. It implies that the craft of working with language models is about wording instructions carefully — writing the magic sentence that unlocks the right output. That framing has always understated the real work, and in 2026 it no longer describes what serious AI teams are actually doing.
What they are doing is context engineering: the deliberate, systematic practice of constructing the right information at the right time in the right format to occupy an LLM's context window at inference. It is part retrieval, part memory management, part system design, and part UX. It is not a prompt. And teams that treat it like one are leaving most of their model's capability on the table.
This post draws on the growing body of practitioner writing — from DSPy researchers at Stanford, to engineering posts from Cohere and Mistral, to developer threads on r/MachineLearning and Hacker News — to map what context engineering actually involves, why it matters more than model size, and what to build first.
Why "Prompt Engineering" Was Always the Wrong Frame
The phrase prompt engineering came from GPT-3-era intuitions: models were opaque, the right phrasing unlocked otherwise hidden capabilities, and the skill was essentially incantatory. Write the spell correctly and the model performed.
That mental model broke down for three reasons:
- Context windows expanded. When models had 4K tokens, prompts were short. At 128K, 200K, and 1M+ tokens, what you put in the window is a system design problem, not a writing problem.
- Systems became dynamic. Modern AI applications don't use a static prompt. They retrieve documents, call tools, maintain memory across sessions, and process intermediate results. The "prompt" is assembled at runtime from many sources.
- The signal moved. Empirically, the biggest performance gaps between teams come from context quality, not from prompt wording. The same model with better context consistently outperforms a larger model with poor context.
The practical consequence: if you are still optimising the wording of a static system prompt while your context architecture is unexamined, you are polishing the wrong thing.
The Six Layers of Context
Context engineering means thinking deliberately about six distinct components that can occupy a model's context window at inference time:
- Layer 1 — System prompt and instructions. Role definition, task constraints, output format specification, tool schemas. This is fixed and relatively small — typically 5–10% of the budget. Most over-engineering happens here while the other layers are ignored.
- Layer 2 — Few-shot examples. Input/output demonstrations that anchor format, tone, and edge case handling. Good few-shots do more reliable work than elaborate instruction paragraphs. They are also dynamic — you can retrieve relevant examples from a library rather than hard-coding them.
- Layer 3 — Working memory and state. The agent's scratchpad: intermediate reasoning, task progress, structured output schema, decisions made so far. This layer grows during a task and needs active management to avoid filling the window.
- Layer 4 — Retrieved documents (RAG). External knowledge fetched at inference time: vector search results, database lookups, codebase snippets, real-time data. This is where most of the context budget goes — and where most of the retrieval quality problems live.
- Layer 5 — Tool and action results. The output of API calls, code execution, search, and other tool invocations. These are often verbose and need truncation, summarisation, or selective inclusion.
- Layer 6 — Conversation history. Prior turns, summarised memory, and persisted user preferences. Long conversations require compression strategies to avoid quadratic cost growth.
The key insight from this decomposition: a "prompt" is layer 1 at best. Context engineering is the practice of managing all six layers simultaneously within a finite token budget.
Context Quality Beats Model Size
The most practically important finding in the context engineering literature is also the most counterintuitive: a smaller model with well-engineered context routinely outperforms a larger model with poor context.
This has been demonstrated across multiple benchmarks. In code generation tasks, GPT-4o with carefully curated few-shots and relevant codebase context outperforms o3 given a generic system prompt. In enterprise RAG deployments, switching from naive full-document retrieval to semantic chunking with re-ranking produces larger accuracy gains than upgrading the model.
The implication for budget allocation is direct: before spending on a larger model, audit your context. In most production systems, the ceiling is context quality, not model capability.
Developer accounts from the community make this concrete:
«We spent three months trying different models. One sprint on context quality got us the same gains we'd been chasing for a quarter.» — Developer on Hacker News
The Context Budget: Treating Tokens as a Finite Resource
The single most useful shift in mental model for context engineering is to treat tokens as a budget. The context window is a scarce resource. Everything that goes in displaces something else.
A 128K token window sounds large until you have:
- A system prompt with tool schemas: ~2K tokens
- 10 retrieved document chunks at 500 tokens each: 5K tokens
- Tool results from 3 API calls: ~3K tokens
- 20 turns of conversation history: ~8K tokens
- Working memory and scratchpad: ~2K tokens
That's 20K tokens already, and the model hasn't produced output yet. In a multi-agent system where context accumulates across handoffs, hitting the ceiling happens faster than most teams expect.
Budget management techniques that appear repeatedly in production systems:
Tiered retrieval
Don't retrieve full documents. Use a two-stage pipeline: a fast, cheap retrieval model selects candidates; a cross-encoder re-ranker selects the highest-relevance passages. A top-5 re-ranked result typically contains more signal than a top-20 naive retrieval, at a fraction of the token cost.
Progressive summarisation
Conversation history and intermediate state grow without bound if unmanaged. The production pattern is to summarise after a threshold — typically after every N turns or when the history exceeds X tokens. The summary replaces the raw history. This is lossy, but the loss is controlled and predictable.
Selective tool output inclusion
API responses and code execution output are often verbose. A common pattern is to extract structured fields rather than including raw responses. If a web search returns 10 pages of HTML, extract the relevant paragraph, not the page. If a database query returns 1,000 rows, include the aggregation, not the table.
Dynamic few-shots
Hard-coded few-shot examples are a missed opportunity. If you have a library of 200 examples, retrieve the 3–5 most semantically similar to the current input. This technique — retrieval-augmented few-shot selection — is one of the highest-leverage context optimisations available and remains underused in practice.
The Retrieval Quality Problem
RAG is where most production context engineering fails. The retrieval step is often treated as solved — embed documents, run a cosine similarity search, return top-K — but the naive implementation has well-documented failure modes:
Chunk boundary errors
Fixed-size chunking often splits semantically coherent passages across chunks, returning half-answers. Document-aware chunking (splitting at section boundaries, paragraphs, or semantic units) consistently outperforms fixed-size chunking on retrieval benchmarks. The cost is higher indexing complexity; the benefit is fewer hallucinations from incomplete context.
Missing context for retrieved chunks
A chunk retrieved in isolation often lacks the context that makes it meaningful. Anthropic's «Contextual Retrieval» technique prepends a chunk-specific context summary — generated once at index time, not at query time — to each chunk before embedding. In their benchmarks, this reduced retrieval failures by 49% in combination with re-ranking.
Query-document mismatch
User queries and document content are in different linguistic registers. Queries are short and conversational; documents are long and formal. HyDE (Hypothetical Document Embeddings) addresses this by generating a hypothetical answer to the query and embedding that, rather than embedding the raw query. The generated answer is in the same linguistic space as the documents, improving retrieval alignment.
Memory Architecture Across Sessions
Single-session context engineering is hard. Multi-session memory architecture is harder. Three patterns have emerged as production-ready:
Semantic memory stores
User preferences, past decisions, and learned patterns stored as vector-indexed records. At session start, retrieve the top-K memories most relevant to the current task. This gives agents a form of long-term memory without bloating the context window with complete history.
Entity-centric memory
Maintain a knowledge graph of entities the agent has encountered (users, projects, concepts) with attributes and relationships. At inference time, retrieve the subgraph relevant to the current context. This is the approach behind MemGPT and several enterprise agent frameworks — it scales better than flat memory stores for agents operating over long time horizons.
Episodic memory with recency decay
Store complete past interactions as episodes with recency-weighted retrieval scores. Recent episodes are retrieved more readily; older ones fade unless reinforced by retrieval. This mirrors human memory and avoids the pathological case where early session context permanently occupies the window.
Context Engineering in Agentic Systems
Everything above becomes more complex in multi-agent systems. Each agent has its own context window. Handoffs between agents carry accumulated context. And the orchestrator needs to make decisions about what state to propagate and what to discard.
Two patterns from the previous post on agent orchestration apply directly here:
State externalisation. Don't accumulate state in agent context windows. Write intermediate results to a shared external store (a database, a structured file, a vector store). At each step, retrieve only the state the agent needs for the current sub-task. This prevents the quadratic cost growth that kills multi-agent system budgets.
Handoff schemas. When one agent hands off to another, it should not pass its entire context window. It should produce a structured handoff document: a summary of what was done, what was decided, what the next agent needs to know. This is context engineering at the system design level — the interface between agents is a context design problem.
From the Anthropic multi-agent research paper:
«The quality of the orchestrator's context determines the quality of every downstream agent. If the orchestrator's working state is polluted with irrelevant history, every handoff carries that pollution forward.»
Measuring Context Quality
You can't improve what you can't measure. Context quality measurement is an emerging practice, but several signals have proven useful in production:
- Retrieval precision@K. Of the K chunks retrieved, how many were cited or used in the final output? Low precision means irrelevant material is polluting the context.
- Context utilisation rate. What fraction of the tokens in the context window contributed to the final output? LLM-based judges can evaluate this. Low utilisation means you're paying for context that isn't helping.
- Faithfulness scores. In RAG systems, does the output faithfully reflect what was in the retrieved documents? High faithfulness with poor retrieval indicates the model is ignoring context. Low faithfulness indicates hallucination.
- Token cost per successful task completion. The most important production metric. If context engineering improvements reduce token cost while maintaining success rate, the investment was worth it.
The DSPy Approach: Systematic Context Optimisation
DSPy, developed at Stanford, takes the most systematic approach to context engineering: rather than manually crafting prompts and few-shots, it treats the entire context construction pipeline as a program with learnable parameters. Given a training set and a metric, DSPy's optimisers automatically discover the best instructions, few-shots, and retrieval strategies for a given task.
The results are striking. On several benchmarks, DSPy-optimised pipelines outperform manually engineered prompts by 10–25% on accuracy metrics while using fewer tokens. The insight underlying DSPy is exactly the context engineering thesis: the bottleneck is the context construction pipeline, and that pipeline should be optimised systematically, not manually.
For teams without the capacity to implement DSPy, the manual equivalent is:
- Log every context window alongside the task outcome (success/failure, quality score)
- Analyse failures: which layer was the problem? Missing retrieval? Irrelevant few-shots? Overloaded history?
- Iterate on the layer with the highest failure rate first
- A/B test context changes with held-out evaluation sets before deploying
What to Build First
The priority order for context engineering investments, based on observed ROI in production systems:
- Audit your retrieval pipeline. Switch from fixed-size chunks to semantic or document-aware chunks. Add a re-ranking step. This single change typically produces the largest quality improvement relative to effort.
- Implement conversation summarisation. Cap conversation history at a token limit and summarise when it's exceeded. This is low effort, prevents runaway costs, and has no quality downside if the summarisation is faithful.
- Add dynamic few-shot retrieval. Build a small library of high-quality examples. At inference time, retrieve the top-3 most similar to the current input. This consistently beats hard-coded examples.
- Externalise agent state. If you're building multi-agent systems, move intermediate state out of context windows into structured external stores from the start. Retrofitting this is expensive.
- Instrument context quality. Add logging for context windows and outcomes. Without this, you are guessing about what is and isn't working.
The Deeper Point
Context engineering is not a technique. It is a perspective shift: from treating language models as text generators that respond to instructions, to treating them as inference engines whose performance is bounded by the quality of the information you give them at runtime.
The models are already capable enough for most of what teams are trying to build. The gap between what teams are achieving and what is possible is almost always a context gap — the model didn't have the right information, in the right format, at the right time.
That gap is an engineering problem. And like all engineering problems, it responds to systematic investment. The teams that understand this — that the model is not the product, and the context pipeline is — will keep pulling away from the teams that are still trying to find the magic prompt.
The spell was never the words. It was always the information behind them.