Context Management: Working Within the Token Budget#

Executive Summary#

The context window is Claude’s working memory – everything the model can reference when generating a response. In Claude Code, it fills with the system prompt, conversation history, tool results, and file contents. Managing this space is the single most important factor in maintaining effective sessions as they grow longer.

ModelStandard WindowExtended (Beta)Long Context Pricing
Claude Opus 4.6200K tokens1M tokens2x input, 1.5x output above 200K
Claude Sonnet 4.6200K tokens1M tokens2x input, 1.5x output above 200K
Claude Sonnet 4.5200K tokens
Claude Sonnet 4200K tokens
Claude Haiku 4.5200K tokens

The 1M token context window is currently available in beta on the API only. Standard claude.ai and Claude Code users access the 200K window unless the beta header is explicitly enabled.

Key insight: Prompt caching reduces the cost of repeated content, but every token still occupies context window space. You can afford a 20,000-token system prompt financially, but those 20,000 tokens are unavailable for conversation content regardless.

Table of Contents#

What the Context Window Is#

The context window is the total space available for all input and output in a single API call. It includes everything Claude can “see” when generating a response: the system prompt, the full conversation history, tool results, and the response itself.

Think of it as a fixed-size desk. The system prompt is a permanent stack of papers that never leaves. Every message you send and every response Claude gives adds more papers. Every file read and tool result adds more. When the desk fills up, something has to go.

How It Fills Up#

Every API call in a Claude Code session sends:

Context Window (200K tokens)
┌──────────────────────────────────────────────────┐
│ System prompt (fixed)             ~15,000-20,000 │
│ ┌──────────────────────────────────────────────┐ │
│ │ Conversation history (grows)                 │ │
│ │ ├── User message 1                           │ │
│ │ ├── Assistant response 1 (+ tool calls)      │ │
│ │ ├── Tool results 1 (file contents, etc.)     │ │
│ │ ├── User message 2                           │ │
│ │ ├── Assistant response 2                     │ │
│ │ ├── ...                                      │ │
│ │ └── Latest user message                      │ │
│ └──────────────────────────────────────────────┘ │
│ Response output (current turn)                   │
└──────────────────────────────────────────────────┘

The system prompt is constant. The conversation history grows with every turn. The response output needs room too. The usable space for conversation shrinks over time.

What Consumes Context#

Not all content is equal. Some things consume far more tokens than expected:

Content TypeTypical SizeNotes
System prompt12,000-20,000 tokensFixed overhead every message (see system prompt article)
User message10-200 tokensYour typed input
Assistant response100-2,000 tokensExplanations, reasoning
Tool call + resultVaries widelyA Read of a 500-line file can be 5,000+ tokens
File read (@-mention)100-10,000+ tokensEntire file contents injected
Grep/Glob results100-5,000 tokensDepends on match count
Web search results500-3,000 tokensSearch snippets
System reminders50-500 tokensHook outputs, plugin status

The biggest consumers are tool results – especially file reads. Reading a large file dumps its entire contents into the context. In an active coding session, multiple file reads and their associated tool call metadata can consume the majority of your context budget.

A Typical Session’s Context Budget#

For a 200K context window with a 15,000-token system prompt:

Available for conversation: 200,000 - 15,000 = 185,000 tokens

At ~100 tokens per typical message exchange:
  Theoretical maximum: ~1,850 simple back-and-forth turns

In practice with tool use (file reads, edits, searches):
  ~200-500 tokens per tool call + result
  ~50-100 meaningful tool-using turns before approaching limits

Real sessions hit limits faster than expected because file reads and search results are token-expensive. A single Read of a large file can consume as much context as 50 simple messages.

Context Awareness#

Claude Sonnet 4.5 and Haiku 4.5 have a built-in feature called context awareness – the model tracks its remaining context budget throughout a conversation.

At session start, Claude receives its total budget:

<budget:token_budget>200000</budget:token_budget>

After each tool call, Claude receives an update:

<system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>

This allows Claude to make informed decisions about how to use remaining context – whether to read a large file, delegate to a subagent, or wrap up work before running out of space.

Note: Opus 4.6 does not currently have context awareness – it does not receive token budget updates. This means Opus may be less efficient at self-managing context in very long sessions.

When Context Runs Out: Compaction#

How Compaction Works#

When the conversation approaches the context window limit, Claude Code automatically summarizes older parts of the conversation to free up space. This is called compaction.

The process:

  1. Input tokens exceed a trigger threshold (default: ~150K tokens)
  2. Claude generates a summary of the conversation so far
  3. The summary replaces the original conversation history
  4. The session continues with the compressed context
Before compaction:
┌──────────────────────────────────────────────────┐
│ System prompt                          15,000    │
│ Turn 1: user + assistant + tools       12,000    │
│ Turn 2: user + assistant + tools        8,000    │
│ ...                                              │
│ Turn 47: user + assistant + tools       5,000    │
│ Turn 48: current                        3,000    │
│                                    ──────────    │
│ Total:                               ~180,000    │ ← Approaching limit
└──────────────────────────────────────────────────┘

After compaction:
┌──────────────────────────────────────────────────┐
│ System prompt                          15,000    │
│ [Summary of turns 1-45]                 3,000    │
│ Turn 46: user + assistant + tools       6,000    │
│ Turn 47: user + assistant + tools       5,000    │
│ Turn 48: current                        3,000    │
│                                    ──────────    │
│ Total:                                ~32,000    │ ← Lots of room again
└──────────────────────────────────────────────────┘

Auto-Compact in Claude Code#

Claude Code handles compaction automatically. You don’t need to configure anything. When the context usage reaches roughly 75-92% (depending on internal heuristics), auto-compact triggers and summarizes the conversation.

You’ll see a message like:

Auto-compact: Summarizing conversation to free up context...

This is normal and expected in long sessions. The session continues without interruption – you don’t lose your place.

Manual Compaction#

You can also trigger compaction manually at any time using the /compact command:

/compact                              # Default summarization
/compact Focus on the API changes     # Custom instructions

Manual compaction is available when context usage is above ~70%. Use it proactively when:

  • You’re about to start a large task and want maximum context space
  • The conversation has accumulated lots of irrelevant history
  • You want to control what gets preserved in the summary

You can also influence compaction behavior through CLAUDE.md instructions to ensure critical context survives summarization.

What Compaction Preserves and Loses#

Compaction is a lossy process. The summary captures the gist of the conversation but loses:

PreservedLost
Key decisions and their reasoningExact wording of earlier exchanges
Current state of the taskDetailed file contents from earlier reads
Recent turns (kept verbatim)Intermediate debugging steps
Important technical detailsTool call metadata
Next steps and pending workExploration paths that were abandoned

This is the fundamental trade-off: compaction lets sessions run indefinitely, but at the cost of detail from earlier in the conversation. The more compactions occur, the more historical detail is lost.

Strategies for Context Efficiency#

Reduce System Prompt Overhead#

The system prompt is a fixed cost on every message. Reducing it frees space for conversation:

  • Disable unused plugins – Each plugin adds skills and subagent descriptions (see token optimization article)
  • Keep CLAUDE.md concise – Every line is re-sent every message
  • Remove unused MCP servers – Each adds tool definitions

A 20K system prompt leaves 180K for conversation. A 12K system prompt leaves 188K. That’s 8K more tokens – roughly 2-3 more file reads.

Use Subagents to Offload Work#

Subagents (the Task tool) run in their own isolated context windows. This means their work doesn’t consume your main context:

Main context (200K)          Subagent context (200K)
┌────────────────────┐       ┌─────────────────────┐
│ System prompt      │       │ Subagent prompt     │
│ Conversation       │  ──→  │ 40 turns of         │
│ "Delegate task"    │       │ investigation       │
│ ← Summary result   │  ←──  │ Detailed findings   │
│ Continue working   │       └─────────────────────┘
└────────────────────┘

A 40-turn investigation that would consume ~20K tokens in your main context becomes a single summary result of ~500 tokens. See subagents as context management below.

Be Selective with File Reads#

File reads are the largest variable context consumer. Strategies:

  • Read specific line ranges instead of entire files when you only need a section
  • Use Grep first to find relevant lines, then read only those areas
  • Avoid re-reading files you’ve already seen unless they’ve changed
  • Use Glob to find files before reading – don’t read files speculatively

A 1,000-line file consumes roughly 8,000-10,000 tokens. Reading 5 such files uses ~40,000-50,000 tokens – a quarter of a 200K context window.

Keep Conversations Focused#

Context accumulates faster in unfocused sessions:

  • Start new sessions for unrelated tasks – Don’t reuse a session for completely different work
  • Commit and start fresh when switching between distinct features
  • Use /compact proactively before starting a new phase of work
  • Be specific in requests – Vague requests lead to more exploratory tool calls that consume context

Use the 1M Context Window#

For sessions that will be particularly long or context-heavy, the 1M token context window provides 5x the standard capacity. This is available in beta for Opus 4.6 and Sonnet 4.6, accessible via the API with the beta header enabled.

Trade-off: Tokens above 200K are billed at 2x input and 1.5x output pricing. For very long sessions, the extra cost may be worth avoiding compaction and its associated information loss. Note that the 1M window is API-only for now – standard claude.ai and Claude Code users remain on the 200K window.

How Context Flows in Claude Code#

A Single Message Round-Trip#

When you send a message, here’s what happens to the context:

You type: "Read the main.go file and add error handling to the HTTP handler"

API call sent:
  System prompt:                    15,000 tokens (cached)
  Previous conversation history:    45,000 tokens (partially cached)
  Your new message:                     30 tokens
                                   ──────────
  Total input:                      60,030 tokens

Claude responds with tool calls:
  [Read main.go]                       → file contents returned: 3,000 tokens
  [Edit main.go]                       → edit confirmation: 200 tokens
  [Read main.go again for verification] → file contents: 3,200 tokens
  Text explanation:                     500 tokens

New context size after this turn:
  Previous:                         60,030 tokens
  + Tool calls and results:          6,900 tokens
  + Response text:                     500 tokens
                                   ──────────
  Total:                            67,430 tokens

Each turn adds the user message, all tool calls and their results, and the assistant’s response to the conversation history. This accumulates quickly.

Growth Over a Session#

Turn  1: 15,000 (system) +  1,000 (conversation) = ~16,000 tokens
Turn 10: 15,000 (system) + 25,000 (conversation) = ~40,000 tokens
Turn 30: 15,000 (system) + 80,000 (conversation) = ~95,000 tokens
Turn 50: 15,000 (system) + 150,000 (conversation) = ~165,000 tokens
Turn 55: Auto-compact triggers → summarizes to ~30,000 conversation tokens
Turn 56: 15,000 (system) + 32,000 (conversation) = ~47,000 tokens
...cycle repeats

The sawtooth pattern: context grows linearly, compaction drops it, then it grows again. Each compaction cycle loses some historical detail.

The Compaction Cycle#

Context
Usage
  │    ╱╲        ╱╲        ╱╲
  │   ╱  ╲      ╱  ╲      ╱  ╲
  │  ╱    ╲    ╱    ╲    ╱    ╲
200K│─────────────────────────────── Limit
  │╱      ╲  ╱      ╲  ╱      ╲
  │        ╲╱        ╲╱        ╲
  └──────────────────────────────→ Time
       Compact  Compact  Compact

Each cycle preserves the system prompt (unchanged) and a summary of previous work. Recent turns are kept verbatim. The effective “memory depth” shrinks with each compaction.

Subagents as Context Management#

How Subagents Help#

Subagents run in their own isolated context windows. By delegating work to a subagent, you keep the results without paying the context cost of the investigation – typically a 97.5% savings on delegated work. Specifically, they:

  1. Prevent context pollution – A 40-turn investigation doesn’t bloat your main context
  2. Provide fresh context – The subagent starts with maximum available space
  3. Return concise summaries – Only the final result enters your main context
  4. Are resumable – Can continue work across multiple invocations without re-consuming main context

The math:

ApproachMain Context Cost
40-turn investigation in main context~20,000 tokens
Same investigation via subagent~500 tokens (summary only)
Savings~19,500 tokens (97.5%)

When to Delegate vs Stay in Main Context#

ScenarioRecommendationWhy
Quick file read + small editMain contextSubagent overhead not worth it
Multi-file explorationSubagentExploration consumes lots of context
Complex debugging (10+ turns)SubagentKeeps investigation isolated
Simple questionMain contextFast, low cost
Code reviewSubagentReads many files, produces structured output
Architecture analysisSubagentDeep reasoning, many file reads

Rule of thumb: If the task will involve reading 3+ files or take more than 10 turns, delegate to a subagent.

Extended Thinking and Context#

When extended thinking is enabled, thinking tokens count toward the context window during the current turn but are automatically stripped from subsequent turns. This means:

  • Thinking tokens don’t accumulate in your conversation history
  • They only consume context during the turn they’re generated
  • The API handles stripping automatically – you don’t need to manage this

This design prevents thinking tokens from eating into your context budget over time. A turn with 10,000 thinking tokens will briefly use that space, but it’s freed for the next turn.

Exception: During tool use, thinking blocks must be preserved until the tool use cycle completes. They’re stripped after the cycle ends.

Why Claude Code Seems Inconsistent#

A common complaint: Claude Code seems to “get dumber” over a long session, or forgets things it should obviously know – like that it already created a branch, or what approach it tried twenty minutes ago. The natural reaction is frustration: you just did this, why are you doing it again?

The explanation is entirely mechanical. There’s no degradation in the model’s reasoning ability – what changes is the information available to it.

Between Sessions: Total Amnesia#

Every new session starts with zero conversation history. The model has no memory of previous sessions except what’s written to persistent files (CLAUDE.md, auto memory, journal entries). If you spent an hour debugging a tricky issue yesterday, Claude knows nothing about it today unless you or it wrote something down.

This is unlike working with a human colleague who retains subconscious patterns and context even when they can’t recall specifics. Claude has literally nothing between sessions unless something wrote it to disk. The memory system – CLAUDE.md, auto memory, episodic memory – exists specifically to compensate for this, but it’s opt-in and lossy.

Within Sessions: Death by Compaction#

Even within a single session, context loss is continuous. Compaction summarizes away details that seemed important at the time. After two or three compaction cycles, Claude may have lost:

  • The specific file paths it explored earlier
  • Which approaches it tried and rejected
  • Exact error messages from failed attempts
  • The reasoning behind a choice it made 40 turns ago

It’s not that the model is getting worse – it’s that the desk is being cleared and only a summary remains. The model works with whatever context it has, and after compaction, that context is thinner.

The Humanization Trap#

We instinctively attribute human-like memory to things that converse fluently. When Claude writes articulate code and explanations, it feels like it “knows” things in the way a person does. When it then forgets something obvious, the gap between expectation and reality feels like stupidity.

But the model doesn’t have a bad memory – it has no memory at all beyond its current context window. Every response is generated from scratch using only what’s visible in that window right now. Understanding this reframes the problem from “Claude is unreliable” to “I need to manage what’s in the context window.”

What You Can Do About It#

  • Write important decisions to files – CLAUDE.md instructions survive compaction and session boundaries
  • Use auto memory and journals – These persist across sessions and give future sessions a head start
  • Front-load context in new sessions – Reference relevant files and state your goals explicitly at session start
  • Compact strategically – Use /compact with custom instructions to preserve specific details
  • Keep sessions focused – One task per session means less context pressure and fewer compactions

The model’s reasoning capability is constant. What varies is how much relevant information it can see. Managing that visibility is the primary skill that separates effective Claude Code usage from frustrating sessions.

Practical Tips#

  1. Watch the context indicator – Claude Code shows context usage. Pay attention to it approaching limits.

  2. Use /compact before big tasks – If you’re about to start something that will read many files, compact first to maximize available space.

  3. Delegate exploration – When you say “find all the places where X is used”, that’s an exploration task. Use the Explore subagent to avoid dumping search results into your main context.

  4. Read files strategically – Use offset and limit parameters on the Read tool to read only the sections you need.

  5. Start new sessions for new tasks – Don’t try to do everything in one session. Context accumulation across unrelated tasks wastes space.

  6. Don’t fight compaction – Auto-compact is designed to keep sessions running. If you need to preserve specific context, use manual /compact with custom instructions.

  7. Use the system prompt wisely – Instructions in CLAUDE.md persist across compactions. If there’s something Claude must always know during a session, put it in CLAUDE.md rather than repeating it in messages (which can be compacted away).

  8. Save state to memory before long sessions – If you’re approaching what might be a compaction, save important decisions and state to your memory files. Memory files survive compaction because they’re re-read from disk, not from conversation history.

References#