Context Management: Working Within the Token Budget#
Executive Summary#
The context window is Claude’s working memory – everything the model can reference when generating a response. In Claude Code, it fills with the system prompt, conversation history, tool results, and file contents. Managing this space is the single most important factor in maintaining effective sessions as they grow longer.
| Model | Standard Window | Extended (Beta) | Long Context Pricing |
|---|---|---|---|
| Claude Opus 4.6 | 200K tokens | 1M tokens | 2x input, 1.5x output above 200K |
| Claude Sonnet 4.6 | 200K tokens | 1M tokens | 2x input, 1.5x output above 200K |
| Claude Sonnet 4.5 | 200K tokens | – | – |
| Claude Sonnet 4 | 200K tokens | – | – |
| Claude Haiku 4.5 | 200K tokens | – | – |
The 1M token context window is currently available in beta on the API only. Standard claude.ai and Claude Code users access the 200K window unless the beta header is explicitly enabled.
Key insight: Prompt caching reduces the cost of repeated content, but every token still occupies context window space. You can afford a 20,000-token system prompt financially, but those 20,000 tokens are unavailable for conversation content regardless.
Table of Contents#
- Context Management: Working Within the Token Budget
What the Context Window Is#
The context window is the total space available for all input and output in a single API call. It includes everything Claude can “see” when generating a response: the system prompt, the full conversation history, tool results, and the response itself.
Think of it as a fixed-size desk. The system prompt is a permanent stack of papers that never leaves. Every message you send and every response Claude gives adds more papers. Every file read and tool result adds more. When the desk fills up, something has to go.
How It Fills Up#
Every API call in a Claude Code session sends:
Context Window (200K tokens)
┌──────────────────────────────────────────────────┐
│ System prompt (fixed) ~15,000-20,000 │
│ ┌──────────────────────────────────────────────┐ │
│ │ Conversation history (grows) │ │
│ │ ├── User message 1 │ │
│ │ ├── Assistant response 1 (+ tool calls) │ │
│ │ ├── Tool results 1 (file contents, etc.) │ │
│ │ ├── User message 2 │ │
│ │ ├── Assistant response 2 │ │
│ │ ├── ... │ │
│ │ └── Latest user message │ │
│ └──────────────────────────────────────────────┘ │
│ Response output (current turn) │
└──────────────────────────────────────────────────┘The system prompt is constant. The conversation history grows with every turn. The response output needs room too. The usable space for conversation shrinks over time.
What Consumes Context#
Not all content is equal. Some things consume far more tokens than expected:
| Content Type | Typical Size | Notes |
|---|---|---|
| System prompt | 12,000-20,000 tokens | Fixed overhead every message (see system prompt article) |
| User message | 10-200 tokens | Your typed input |
| Assistant response | 100-2,000 tokens | Explanations, reasoning |
| Tool call + result | Varies widely | A Read of a 500-line file can be 5,000+ tokens |
| File read (@-mention) | 100-10,000+ tokens | Entire file contents injected |
| Grep/Glob results | 100-5,000 tokens | Depends on match count |
| Web search results | 500-3,000 tokens | Search snippets |
| System reminders | 50-500 tokens | Hook outputs, plugin status |
The biggest consumers are tool results – especially file reads. Reading a large file dumps its entire contents into the context. In an active coding session, multiple file reads and their associated tool call metadata can consume the majority of your context budget.
A Typical Session’s Context Budget#
For a 200K context window with a 15,000-token system prompt:
Available for conversation: 200,000 - 15,000 = 185,000 tokens
At ~100 tokens per typical message exchange:
Theoretical maximum: ~1,850 simple back-and-forth turns
In practice with tool use (file reads, edits, searches):
~200-500 tokens per tool call + result
~50-100 meaningful tool-using turns before approaching limitsReal sessions hit limits faster than expected because file reads and search results are token-expensive. A single Read of a large file can consume as much context as 50 simple messages.
Context Awareness#
Claude Sonnet 4.5 and Haiku 4.5 have a built-in feature called context awareness – the model tracks its remaining context budget throughout a conversation.
At session start, Claude receives its total budget:
<budget:token_budget>200000</budget:token_budget>After each tool call, Claude receives an update:
<system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>This allows Claude to make informed decisions about how to use remaining context – whether to read a large file, delegate to a subagent, or wrap up work before running out of space.
Note: Opus 4.6 does not currently have context awareness – it does not receive token budget updates. This means Opus may be less efficient at self-managing context in very long sessions.
When Context Runs Out: Compaction#
How Compaction Works#
When the conversation approaches the context window limit, Claude Code automatically summarizes older parts of the conversation to free up space. This is called compaction.
The process:
- Input tokens exceed a trigger threshold (default: ~150K tokens)
- Claude generates a summary of the conversation so far
- The summary replaces the original conversation history
- The session continues with the compressed context
Before compaction:
┌──────────────────────────────────────────────────┐
│ System prompt 15,000 │
│ Turn 1: user + assistant + tools 12,000 │
│ Turn 2: user + assistant + tools 8,000 │
│ ... │
│ Turn 47: user + assistant + tools 5,000 │
│ Turn 48: current 3,000 │
│ ────────── │
│ Total: ~180,000 │ ← Approaching limit
└──────────────────────────────────────────────────┘
After compaction:
┌──────────────────────────────────────────────────┐
│ System prompt 15,000 │
│ [Summary of turns 1-45] 3,000 │
│ Turn 46: user + assistant + tools 6,000 │
│ Turn 47: user + assistant + tools 5,000 │
│ Turn 48: current 3,000 │
│ ────────── │
│ Total: ~32,000 │ ← Lots of room again
└──────────────────────────────────────────────────┘Auto-Compact in Claude Code#
Claude Code handles compaction automatically. You don’t need to configure anything. When the context usage reaches roughly 75-92% (depending on internal heuristics), auto-compact triggers and summarizes the conversation.
You’ll see a message like:
Auto-compact: Summarizing conversation to free up context...This is normal and expected in long sessions. The session continues without interruption – you don’t lose your place.
Manual Compaction#
You can also trigger compaction manually at any time using the /compact command:
/compact # Default summarization
/compact Focus on the API changes # Custom instructionsManual compaction is available when context usage is above ~70%. Use it proactively when:
- You’re about to start a large task and want maximum context space
- The conversation has accumulated lots of irrelevant history
- You want to control what gets preserved in the summary
You can also influence compaction behavior through CLAUDE.md instructions to ensure critical context survives summarization.
What Compaction Preserves and Loses#
Compaction is a lossy process. The summary captures the gist of the conversation but loses:
| Preserved | Lost |
|---|---|
| Key decisions and their reasoning | Exact wording of earlier exchanges |
| Current state of the task | Detailed file contents from earlier reads |
| Recent turns (kept verbatim) | Intermediate debugging steps |
| Important technical details | Tool call metadata |
| Next steps and pending work | Exploration paths that were abandoned |
This is the fundamental trade-off: compaction lets sessions run indefinitely, but at the cost of detail from earlier in the conversation. The more compactions occur, the more historical detail is lost.
Strategies for Context Efficiency#
Reduce System Prompt Overhead#
The system prompt is a fixed cost on every message. Reducing it frees space for conversation:
- Disable unused plugins – Each plugin adds skills and subagent descriptions (see token optimization article)
- Keep CLAUDE.md concise – Every line is re-sent every message
- Remove unused MCP servers – Each adds tool definitions
A 20K system prompt leaves 180K for conversation. A 12K system prompt leaves 188K. That’s 8K more tokens – roughly 2-3 more file reads.
Use Subagents to Offload Work#
Subagents (the Task tool) run in their own isolated context windows. This means their work doesn’t consume your main context:
Main context (200K) Subagent context (200K)
┌────────────────────┐ ┌─────────────────────┐
│ System prompt │ │ Subagent prompt │
│ Conversation │ ──→ │ 40 turns of │
│ "Delegate task" │ │ investigation │
│ ← Summary result │ ←── │ Detailed findings │
│ Continue working │ └─────────────────────┘
└────────────────────┘A 40-turn investigation that would consume ~20K tokens in your main context becomes a single summary result of ~500 tokens. See subagents as context management below.
Be Selective with File Reads#
File reads are the largest variable context consumer. Strategies:
- Read specific line ranges instead of entire files when you only need a section
- Use Grep first to find relevant lines, then read only those areas
- Avoid re-reading files you’ve already seen unless they’ve changed
- Use Glob to find files before reading – don’t read files speculatively
A 1,000-line file consumes roughly 8,000-10,000 tokens. Reading 5 such files uses ~40,000-50,000 tokens – a quarter of a 200K context window.
Keep Conversations Focused#
Context accumulates faster in unfocused sessions:
- Start new sessions for unrelated tasks – Don’t reuse a session for completely different work
- Commit and start fresh when switching between distinct features
- Use
/compactproactively before starting a new phase of work - Be specific in requests – Vague requests lead to more exploratory tool calls that consume context
Use the 1M Context Window#
For sessions that will be particularly long or context-heavy, the 1M token context window provides 5x the standard capacity. This is available in beta for Opus 4.6 and Sonnet 4.6, accessible via the API with the beta header enabled.
Trade-off: Tokens above 200K are billed at 2x input and 1.5x output pricing. For very long sessions, the extra cost may be worth avoiding compaction and its associated information loss. Note that the 1M window is API-only for now – standard claude.ai and Claude Code users remain on the 200K window.
How Context Flows in Claude Code#
A Single Message Round-Trip#
When you send a message, here’s what happens to the context:
You type: "Read the main.go file and add error handling to the HTTP handler"
API call sent:
System prompt: 15,000 tokens (cached)
Previous conversation history: 45,000 tokens (partially cached)
Your new message: 30 tokens
──────────
Total input: 60,030 tokens
Claude responds with tool calls:
[Read main.go] → file contents returned: 3,000 tokens
[Edit main.go] → edit confirmation: 200 tokens
[Read main.go again for verification] → file contents: 3,200 tokens
Text explanation: 500 tokens
New context size after this turn:
Previous: 60,030 tokens
+ Tool calls and results: 6,900 tokens
+ Response text: 500 tokens
──────────
Total: 67,430 tokensEach turn adds the user message, all tool calls and their results, and the assistant’s response to the conversation history. This accumulates quickly.
Growth Over a Session#
Turn 1: 15,000 (system) + 1,000 (conversation) = ~16,000 tokens
Turn 10: 15,000 (system) + 25,000 (conversation) = ~40,000 tokens
Turn 30: 15,000 (system) + 80,000 (conversation) = ~95,000 tokens
Turn 50: 15,000 (system) + 150,000 (conversation) = ~165,000 tokens
Turn 55: Auto-compact triggers → summarizes to ~30,000 conversation tokens
Turn 56: 15,000 (system) + 32,000 (conversation) = ~47,000 tokens
...cycle repeatsThe sawtooth pattern: context grows linearly, compaction drops it, then it grows again. Each compaction cycle loses some historical detail.
The Compaction Cycle#
Context
Usage
▲
│ ╱╲ ╱╲ ╱╲
│ ╱ ╲ ╱ ╲ ╱ ╲
│ ╱ ╲ ╱ ╲ ╱ ╲
200K│─────────────────────────────── Limit
│╱ ╲ ╱ ╲ ╱ ╲
│ ╲╱ ╲╱ ╲
│
└──────────────────────────────→ Time
Compact Compact CompactEach cycle preserves the system prompt (unchanged) and a summary of previous work. Recent turns are kept verbatim. The effective “memory depth” shrinks with each compaction.
Subagents as Context Management#
How Subagents Help#
Subagents run in their own isolated context windows. By delegating work to a subagent, you keep the results without paying the context cost of the investigation – typically a 97.5% savings on delegated work. Specifically, they:
- Prevent context pollution – A 40-turn investigation doesn’t bloat your main context
- Provide fresh context – The subagent starts with maximum available space
- Return concise summaries – Only the final result enters your main context
- Are resumable – Can continue work across multiple invocations without re-consuming main context
The math:
| Approach | Main Context Cost |
|---|---|
| 40-turn investigation in main context | ~20,000 tokens |
| Same investigation via subagent | ~500 tokens (summary only) |
| Savings | ~19,500 tokens (97.5%) |
When to Delegate vs Stay in Main Context#
| Scenario | Recommendation | Why |
|---|---|---|
| Quick file read + small edit | Main context | Subagent overhead not worth it |
| Multi-file exploration | Subagent | Exploration consumes lots of context |
| Complex debugging (10+ turns) | Subagent | Keeps investigation isolated |
| Simple question | Main context | Fast, low cost |
| Code review | Subagent | Reads many files, produces structured output |
| Architecture analysis | Subagent | Deep reasoning, many file reads |
Rule of thumb: If the task will involve reading 3+ files or take more than 10 turns, delegate to a subagent.
Extended Thinking and Context#
When extended thinking is enabled, thinking tokens count toward the context window during the current turn but are automatically stripped from subsequent turns. This means:
- Thinking tokens don’t accumulate in your conversation history
- They only consume context during the turn they’re generated
- The API handles stripping automatically – you don’t need to manage this
This design prevents thinking tokens from eating into your context budget over time. A turn with 10,000 thinking tokens will briefly use that space, but it’s freed for the next turn.
Exception: During tool use, thinking blocks must be preserved until the tool use cycle completes. They’re stripped after the cycle ends.
Why Claude Code Seems Inconsistent#
A common complaint: Claude Code seems to “get dumber” over a long session, or forgets things it should obviously know – like that it already created a branch, or what approach it tried twenty minutes ago. The natural reaction is frustration: you just did this, why are you doing it again?
The explanation is entirely mechanical. There’s no degradation in the model’s reasoning ability – what changes is the information available to it.
Between Sessions: Total Amnesia#
Every new session starts with zero conversation history. The model has no memory of previous sessions except what’s written to persistent files (CLAUDE.md, auto memory, journal entries). If you spent an hour debugging a tricky issue yesterday, Claude knows nothing about it today unless you or it wrote something down.
This is unlike working with a human colleague who retains subconscious patterns and context even when they can’t recall specifics. Claude has literally nothing between sessions unless something wrote it to disk. The memory system – CLAUDE.md, auto memory, episodic memory – exists specifically to compensate for this, but it’s opt-in and lossy.
Within Sessions: Death by Compaction#
Even within a single session, context loss is continuous. Compaction summarizes away details that seemed important at the time. After two or three compaction cycles, Claude may have lost:
- The specific file paths it explored earlier
- Which approaches it tried and rejected
- Exact error messages from failed attempts
- The reasoning behind a choice it made 40 turns ago
It’s not that the model is getting worse – it’s that the desk is being cleared and only a summary remains. The model works with whatever context it has, and after compaction, that context is thinner.
The Humanization Trap#
We instinctively attribute human-like memory to things that converse fluently. When Claude writes articulate code and explanations, it feels like it “knows” things in the way a person does. When it then forgets something obvious, the gap between expectation and reality feels like stupidity.
But the model doesn’t have a bad memory – it has no memory at all beyond its current context window. Every response is generated from scratch using only what’s visible in that window right now. Understanding this reframes the problem from “Claude is unreliable” to “I need to manage what’s in the context window.”
What You Can Do About It#
- Write important decisions to files – CLAUDE.md instructions survive compaction and session boundaries
- Use auto memory and journals – These persist across sessions and give future sessions a head start
- Front-load context in new sessions – Reference relevant files and state your goals explicitly at session start
- Compact strategically – Use
/compactwith custom instructions to preserve specific details - Keep sessions focused – One task per session means less context pressure and fewer compactions
The model’s reasoning capability is constant. What varies is how much relevant information it can see. Managing that visibility is the primary skill that separates effective Claude Code usage from frustrating sessions.
Practical Tips#
Watch the context indicator – Claude Code shows context usage. Pay attention to it approaching limits.
Use
/compactbefore big tasks – If you’re about to start something that will read many files, compact first to maximize available space.Delegate exploration – When you say “find all the places where X is used”, that’s an exploration task. Use the Explore subagent to avoid dumping search results into your main context.
Read files strategically – Use
offsetandlimitparameters on the Read tool to read only the sections you need.Start new sessions for new tasks – Don’t try to do everything in one session. Context accumulation across unrelated tasks wastes space.
Don’t fight compaction – Auto-compact is designed to keep sessions running. If you need to preserve specific context, use manual
/compactwith custom instructions.Use the system prompt wisely – Instructions in CLAUDE.md persist across compactions. If there’s something Claude must always know during a session, put it in CLAUDE.md rather than repeating it in messages (which can be compacted away).
Save state to memory before long sessions – If you’re approaching what might be a compaction, save important decisions and state to your memory files. Memory files survive compaction because they’re re-read from disk, not from conversation history.
References#
- Context Windows (Anthropic Docs) – Context window sizes, long context, context awareness
- Compaction (Anthropic Docs) – Server-side compaction API
- Context Editing (Anthropic Docs) – Tool result clearing, thinking block clearing
- System Prompt Article – What occupies the fixed portion of your context
- Token Optimization Article – Reducing system prompt overhead
- Prompt Caching Article – Cost reduction (distinct from context space)
- Extension Mechanisms Article – Subagents for context isolation