Extended Thinking: How Claude Reasons Through Complex Problems#

Executive Summary#

Extended thinking gives Claude additional tokens to reason before responding. On Opus 4.6 and Sonnet 4.6, thinking is adaptive: Claude decides how much to think based on task complexity. Thinking tokens are billed as output tokens ($25/MTok on Opus 4.6), making thinking depth the second-biggest cost lever after model selection. On Opus 4.6, effort levels (low/medium/high/max) control how much Claude thinks; Sonnet 4.6 supports low/medium/high effort.

AspectDetails
Default stateEnabled by default in Claude Code
Opus 4.6 / Sonnet 4.6 modeAdaptive (dynamic depth based on complexity)
Other modelsManual (fixed budget via budget_tokens)
Default budget31,999 tokens (configurable via MAX_THINKING_TOKENS)
BillingThinking tokens billed as output tokens
VisibilitySummarized view; Ctrl+O for verbose thinking text

Table of Contents#

How Extended Thinking Works#

Why Intermediate Tokens Help#

Token generation is autoregressive: each token is predicted from all prior tokens in the context window. When Claude generates intermediate reasoning tokens, those tokens become context for the tokens that follow. The model is using its own output as working memory.

A direct prompt-to-answer jump gives the final answer a short context to condition on. A chain of intermediate steps – exploring an approach, identifying a constraint, reconsidering – gives the final answer more to work from. This is why thinking improves quality on problems that require multiple dependent steps, and why it has no effect on problems where the answer is immediate.

There is no separate reasoning system running in parallel. The thinking phase and the response phase use the same next-token prediction mechanism. The difference is that thinking tokens are generated first, accumulate in context, and are then available when the response tokens are generated.

The Thinking Process#

When thinking is enabled, Claude generates internal reasoning before crafting its response:

User prompt arrives
┌─────────────────────┐
│  Thinking Phase     │  Claude generates intermediate reasoning:
│  (thinking tokens)  │  - Explores approaches
│                     │  - Identifies constraints and edge cases
│                     │  - Backtracks from dead ends
│                     │  - Commits to an approach
└──────────┬──────────┘
┌─────────────────────┐
│  Summary Phase      │  Full thinking summarized
│  (no extra charge)  │  for user visibility
└──────────┬──────────┘
┌─────────────────────┐
│  Response Phase     │  Final answer conditioned
│  (output tokens)    │  on all prior thinking tokens
└─────────────────────┘

The thinking phase is where quality differences appear. On problems that require multiple dependent steps, thinking tokens give the response tokens more context to condition on. On problems that don’t – simple lookups, mechanical edits – thinking tokens add cost without improving output.

Adaptive Thinking#

Opus 4.6 and Sonnet 4.6 use adaptive thinking by default. Instead of a fixed budget, Claude decides how much to think based on the complexity of each request:

  • Simple requests (rename a variable, fix a typo): minimal or no thinking
  • Moderate requests (implement a function, fix a bug): moderate thinking
  • Complex requests (architect a system, debug a race condition): deep thinking

This replaces the manual budget_tokens approach used on earlier models. You don’t need to estimate how many thinking tokens a task needs – Claude adjusts automatically.

Combined with effort levels, adaptive thinking gives you a spectrum from fast/cheap to thorough/expensive without manual tuning.

Summarized Thinking#

Claude 4 models return a summarized version of the thinking, not the raw thinking output:

  • You see a summary of the key reasoning steps
  • You are billed for the full thinking tokens, not the summary
  • The summary is generated by a separate model at no extra charge
  • The thinking model does not see the summarized output

The billed output token count will be higher than the visible token count. This is expected.

In Claude Code, toggle verbose mode (Ctrl+O) to see the thinking text as gray italic text in the transcript.

Interleaved Thinking#

With tool use, Claude can think between tool calls, including between each tool result:

Think → Call tool → Read result → Think again → Call another tool → Think → Respond

On Opus 4.6 with adaptive thinking, interleaved thinking is automatic. On earlier models, it requires a beta header. In Claude Code, this is handled transparently – you don’t need to configure anything.

Interleaved thinking is useful for multi-step tasks where each tool result changes what to do next. Claude can reconsider its approach after seeing actual file contents or command output rather than committing to a plan upfront.

Effort Levels#

Available Levels#

LevelThinking BehaviorSpeedCostAvailability
maxNo depth limit, absolute maximumSlowestHighestOpus 4.6 only
highAlmost always thinks deeply (default)SlowHighAll models
mediumMay skip thinking for simple queriesMediumMediumAll models
lowMinimizes or skips thinkingFastLowAll models

max is exclusive to Opus 4.6 and errors on other models.

Effort is a behavioral signal, not a strict token budget. At lower effort, Claude still thinks on genuinely difficult problems – it just thinks less than it would at higher effort for the same problem.

What Effort Controls#

Effort affects all tokens in the response, including non-thinking output:

AspectLow EffortHigh Effort
Thinking depthMinimal or skipped for simple tasksDeep reasoning on most tasks
Tool callsFewer, combined operationsMore, thorough exploration
ExplanationsTerse confirmationsDetailed plans and summaries
Code commentsMinimalComprehensive
Action styleProceeds directlyExplains approach before acting

Setting Effort in Claude Code#

Three methods, in order of priority:

  • /model command: Use left/right arrow keys to adjust the effort slider when selecting a model.
  • Environment variable: CLAUDE_CODE_EFFORT_LEVEL=low|medium|high
  • Settings file: Set effortLevel in your settings JSON.

Configuration#

Toggle Thinking On/Off#

MethodScopeDetails
Option+T / Alt+TCurrent sessionToggles thinking for this session only
/configPermanentSaved as alwaysThinkingEnabled in settings
Ctrl+ODisplay onlyShows thinking text in verbose mode

MAX_THINKING_TOKENS#

Controls the thinking token budget for manual-mode models:

SettingValue
Default31,999 tokens
Maximum63,999 tokens
Minimum1,024 tokens
Disable thinkingSet to 0
# Temporary (session only)
MAX_THINKING_TOKENS=63999 claude

# Permanent
export MAX_THINKING_TOKENS=63999  # in ~/.zshrc or ~/.bashrc

On Opus 4.6, MAX_THINKING_TOKENS is ignored because adaptive thinking controls depth dynamically. Exception: setting it to 0 still disables thinking entirely.

Thinking Budget vs Output Budget#

  • budget_tokens must be less than max_tokens (standard mode)
  • With interleaved thinking (tool use), budget_tokens can exceed max_tokens
  • Opus 4.6: up to 128K output tokens
  • Earlier models: up to 64K output tokens

The budget is a target, not a strict limit – actual usage varies by task.

Context Window Interaction#

Thinking tokens interact with the context window differently depending on the model:

Opus 4.5 and later (including 4.6): Thinking blocks from previous turns are preserved in context. Thinking tokens consume context window space across turns.

Earlier models: Thinking blocks are stripped from context between turns. Only the final response carries forward.

With tool use (all models):

context = input tokens + previous thinking tokens + tool tokens + new thinking + response

Without tool use:

context = input tokens - previous thinking tokens + new thinking + response

Thinking in Subagents#

Each subagent has its own context window and thinking budget. Considerations:

  • Set model: haiku or model: sonnet on subagents that don’t need deep reasoning
  • CLAUDE_CODE_SUBAGENT_MODEL overrides all subagent model settings
  • Low effort is recommended for subagents doing simple tasks (research, file searching)
  • Thinking adds latency – for parallel subagent work, lower effort means faster results

When to Use Extended Thinking#

The deciding factor is whether the task requires multiple dependent reasoning steps. If the answer requires working through constraints, dependencies, or tradeoffs that build on each other, thinking tokens improve output. If the answer is deterministic or immediate, they don’t.

Tasks That Benefit#

Task TypeWhy Thinking Helps
Architecture decisionsDependencies between components require multi-step analysis
Complex debuggingHypothesis → test → revise cycles benefit from working memory
Implementation planningSequencing work correctly requires tracking many constraints
Algorithm designEdge cases interact; reasoning through them compounds
Security reviewData flows require tracing across many steps
Multi-file refactoringCross-file dependencies need to be tracked simultaneously

Tasks Where It’s Overkill#

Task TypeWhy Thinking Adds No Value
Simple file editsNo ambiguity to reason about
Formatting changesMechanical transformation, no judgment needed
Find-and-replaceDeterministic operation
File reads and searchesNo decision-making involved
Quick lookupsAnswer is immediate, no reasoning chain needed

On Opus 4.6 with adaptive thinking, Claude already allocates less thinking to simple tasks. For explicit control, use low effort for routine work and high or max for complex tasks.

Cost Management#

How Thinking Tokens Are Billed#

Thinking tokens are billed as output tokens at the model’s output rate:

ModelOutput Rate (including thinking)Cache Read
Opus 4.6$25/MTok$0.50/MTok
Sonnet 4.5$15/MTok$0.30/MTok
Haiku 4.5$5/MTok$0.10/MTok

A request that generates 10,000 thinking tokens + 2,000 response tokens costs the same as 12,000 output tokens.

Cost Control Levers#

LeverEffectHow to Set
Effort levelControls thinking depth (biggest lever)/model slider, env var, settings
Model selectionSonnet at $15/MTok vs Opus at $25/MTok/model command
Disable thinkingZero thinking tokensOption+T / Alt+T, /config
MAX_THINKING_TOKENSCap thinking budget (non-Opus-4.6 models)Environment variable
Subagent model choiceUse cheaper models for simple delegated tasksPer-agent model: field

Cost Estimates#

Rough estimates for a 100-message Opus 4.6 session:

Effort LevelAvg Thinking Tokens/MsgThinking Cost (100 msgs)Total Session Cost
max~20,000~$50~$60-75
high~10,000~$25~$35-50
medium~5,000~$12.50~$20-30
low~1,000~$2.50~$10-15

Actual costs vary based on task complexity, response length, and tool call volume.

Model Support#

ModelThinking ModeEffort LevelsInterleavedMax Output
Opus 4.6Adaptivelow, medium, high, maxAutomatic128K
Sonnet 4.6Adaptivelow, medium, highAutomatic64K
Opus 4.5Manual (budget)low, medium, highBeta header128K
Sonnet 4.5Manual (budget)low, medium, highBeta header64K
Haiku 4.5Manual (budget)low, medium, highBeta header64K

Adaptive thinking is available on Opus 4.6 and Sonnet 4.6. Manual budget_tokens mode is deprecated on these models but remains available on all others. max effort remains exclusive to Opus 4.6.

Feature Compatibility#

Extended thinking is not compatible with:

  • temperature modifications (must be default)
  • top_k modifications
  • Forced tool use (tool_choice: "any" or tool_choice: "tool")
  • Response pre-filling

top_p can be set between 0.95 and 1.0 when thinking is enabled.

Changing thinking parameters invalidates prompt cache for messages, though system prompts and tool definitions remain cached.

Best Practices#

  1. Use adaptive thinking on Opus 4.6. Don’t set manual budgets – let Claude decide how much to think. This is the default in Claude Code and works well for most workflows.

  2. Use effort levels as your primary cost control. Instead of toggling thinking on/off, adjust effort. medium provides a good balance for daily work. high or max for complex architecture and debugging.

  3. Use lower effort for subagents. Subagents doing research, file searching, or simple analysis don’t need deep thinking. Set model: sonnet or effort to low on delegated tasks.

  4. Don’t optimize thinking for simple tasks. On Opus 4.6 with adaptive thinking, Claude already minimizes thinking for simple requests.

  5. Enable verbose mode for debugging. Ctrl+O shows thinking text, which helps understand why Claude made specific decisions. Useful when Claude’s response doesn’t match expectations.

  6. Budget thinking tokens for cost-sensitive workflows. In CI/CD or headless mode, set MAX_THINKING_TOKENS to a reasonable cap to prevent runaway costs on unexpectedly complex inputs.

  7. Use the opusplan alias. Opus for planning (where thinking helps most) and Sonnet for execution (where thinking is less critical) is a reasonable cost/quality split.

Anti-Patterns#

  1. Setting MAX_THINKING_TOKENS on Opus 4.6. The variable is ignored on Opus 4.6 (except 0). Use effort levels instead.

  2. Disabling thinking globally. On complex tasks, thinking tokens are where quality improvements come from. Disable it per-task if needed, not globally.

  3. Using “ultrathink” or “think hard” in prompts. These phrases are interpreted as regular text, not as thinking budget controls. The old “ultrathink” keyword hack has been deprecated.

  4. Max effort on routine tasks. max effort on simple file edits wastes tokens and adds latency. Reserve max for tasks that require many dependent reasoning steps.

  5. Expecting visible token counts to match billing. Claude 4 models show summarized thinking. The billed count (full thinking) is higher than what you see. This is expected, not a bug.

  6. Ignoring thinking costs in headless mode. Automated pipelines can run many requests. Without MAX_THINKING_TOKENS or effort limits, thinking costs accumulate quickly.

References#