Prompt Caching: Why Your System Prompt Doesn’t Cost What You Think#

Executive Summary#

Prompt caching allows the API to reuse previously processed prompt prefixes, reducing both cost and latency. Since Claude Code re-sends the system prompt on every API call, caching is what makes large system prompts economically viable. Without caching, a 200-message session with a 15,000-token system prompt would cost ~$15 on Opus 4.6. With caching, it costs ~$1.60 – an 89% reduction.

Metric	Without Caching	With Caching	Savings
System prompt cost (200 msgs, Opus 4.6)	~$15.00	~$1.60	89%
System prompt cost (200 msgs, Sonnet 4.5)	~$9.00	~$0.96	89%
Latency (long prompts)	Full processing	Up to 85% faster	Significant

Pricing multipliers (all models):

Operation	Multiplier vs Base Input	Opus 4.6 ($5 base)	Sonnet 4.5 ($3 base)
Cache write (5-min TTL)	1.25x	$6.25/MTok	$3.75/MTok
Cache write (1-hour TTL)	2x	$10/MTok	$6/MTok
Cache read	0.1x	$0.50/MTok	$0.30/MTok
Uncached input	1x	$5/MTok	$3/MTok

Table of Contents#

Prompt Caching: Why Your System Prompt Doesn’t Cost What You Think

How Prompt Caching Works#

The Basic Mechanism#

Every API call sends the full prompt to the model. Without caching, the model processes every token from scratch each time. With caching, the model can reuse previously processed prefixes:

Message 1 (no cache exists yet):
┌─────────────────────────────────────────┐
│ System prompt (15,000 tokens)           │ ← Processed and written to cache
│ User message (50 tokens)                │ ← Processed normally
└─────────────────────────────────────────┘
  Cost: 15,000 × $6.25/MTok (cache write) + 50 × $5/MTok (uncached)

Message 2 (cache hit):
┌─────────────────────────────────────────┐
│ System prompt (15,000 tokens)           │ ← Read from cache (90% cheaper)
│ Previous conversation (500 tokens)      │ ← Processed normally
│ User message (50 tokens)                │ ← Processed normally
└─────────────────────────────────────────┘
  Cost: 15,000 × $0.50/MTok (cache read) + 550 × $5/MTok (uncached)

The cache operates on prefix matching – it caches everything from the beginning of the prompt up to a designated breakpoint. The prefix must be identical between requests for a cache hit.

Cache Lifetime (TTL)#

Default: 5 minutes – Refreshed every time the cached content is used (so active sessions keep the cache alive indefinitely)
Optional: 1 hour – Costs more to write (2x base instead of 1.25x) but survives longer idle periods
No manual cache clearing – entries expire automatically after the TTL without use

For interactive Claude Code sessions the 5-minute default is fine: as long as you send messages within 5 minutes of each other, the cache stays warm. For a longer-lived cache, the TTL is configurable through environment variables (see Configuring Cache Behavior). Subscription (Pro/Max) users get the 1-hour TTL automatically while usage stays within plan limits; once you exceed the limit and draw on usage credits, the TTL drops back to 5 minutes because that usage is billed per token.

What Gets Cached#

The cache covers the full prompt prefix in order: tools → system → messages. This order forms a hierarchy where each level builds on the previous.

Cacheable content includes:

Tool definitions (including MCP tools)
System messages (instructions, CLAUDE.md content, skill catalogs)
Conversation messages (user turns, assistant turns, tool results)
Images and documents in messages
Tool use and tool result blocks

Not cacheable: Thinking blocks (from extended thinking) cannot be explicitly cached, though they get cached implicitly when passed back in tool use flows.

Prompt Caching in Claude Code#

Why It Matters for Claude Code#

Claude Code is uniquely positioned to benefit from prompt caching because its system prompt is:

Large – Typically 12,000-20,000 tokens (see the system prompt article)
Stable – The same content is sent on every message within a session
Repeated frequently – Sessions routinely involve 50-200+ API calls

Without caching, a Claude Code session would be prohibitively expensive. The system prompt alone would cost $15-20 per 200-message session on Opus 4.6. Caching brings that down to ~$1.50-2.00.

The Cache Hierarchy#

Claude Code’s prompt follows the standard cache hierarchy:

1. Tool definitions (Read, Edit, Bash, Glob, Grep, Task, etc.)
   + MCP tool definitions (context7, episodic-memory, etc.)
       │
2. System messages
   ├── Core instructions (safety rules, behavior guidelines)
   ├── CLAUDE.md files (all scopes)
   ├── Skill catalog (names + descriptions)
   ├── Subagent catalog (in Task tool definition)
   ├── MCP server instructions
   └── Environment context (git status, working directory)
       │
3. Conversation messages
   ├── User messages
   ├── Assistant responses
   ├── Tool calls and results
   └── System reminders (hook outputs, plugin status)

Levels 1 and 2 are nearly identical across messages – this is what gets cached. Level 3 grows with each message; the new portions are uncached.

What Stays Cached Between Messages#

In a typical Claude Code session:

Content	Cached?	Why
Tool definitions	Yes (after first message)	Identical every message
Core instructions	Yes	Identical every message
CLAUDE.md files	Yes	Identical every message
Skill catalog	Yes	Identical every message
Subagent catalog	Yes	Identical every message
Previous conversation turns	Mostly	Prefix is identical; only new turns are uncached
Latest user message	No	New content each time
System reminders	Partially	Some are identical, some vary

The result: after the first message, roughly 80-95% of input tokens on subsequent messages are cache reads at 10% of the base price.

What Breaks the Cache#

Changes to the prompt prefix invalidate the cache from that point forward:

Change	Impact
Tool definition changes	Invalidates everything (tools are first in hierarchy)
System prompt changes	Invalidates system + messages cache
Enabling/disabling a plugin mid-session	Full cache invalidation (changes tool definitions + system prompt)
Adding/removing an MCP server	Full cache invalidation
Context compaction (long sessions)	Conversation history changes, partial invalidation

In practice, cache invalidation rarely happens within a Claude Code session. The system prompt doesn’t change mid-session. The main cause of partial invalidation is context compaction in very long sessions, where older messages get summarized.

Configuring Cache Behavior#

Breakpoint placement is automatic. A few environment variables set the cache TTL or turn caching off, and one CLI flag controls the cached prefix:

Setting	Effect
`ENABLE_PROMPT_CACHING_1H=1`	Request a 1-hour cache TTL instead of the 5-minute default (API key, Bedrock, Vertex, Foundry, Claude Platform on AWS). 1-hour writes cost 2x base. (Subscription users already get the 1-hour TTL automatically, as noted above.)
`FORCE_PROMPT_CACHING_5M=1`	Force the 5-minute TTL even where 1-hour would otherwise apply, overriding `ENABLE_PROMPT_CACHING_1H` (including one set in managed settings). Useful for debugging or comparing the two TTLs.
`ENABLE_PROMPT_CACHING_1H_BEDROCK`	Deprecated – use `ENABLE_PROMPT_CACHING_1H`.
`DISABLE_PROMPT_CACHING=1`	Disable caching for all models (takes precedence over the per-model variants `DISABLE_PROMPT_CACHING_OPUS`, `DISABLE_PROMPT_CACHING_SONNET`, `DISABLE_PROMPT_CACHING_HAIKU`, and `DISABLE_PROMPT_CACHING_FABLE`). Claude Code prints a startup warning whenever caching is disabled this way.

For scripted, multi-user headless runs, claude -p --exclude-dynamic-system-prompt-sections "<query>" moves the per-machine system-prompt sections (working directory, environment info, memory paths, git-repo flag) into the first user message. The cached prefix then stays identical across users and machines running the same task. The flag applies only to the default system prompt; it is ignored when --system-prompt or --system-prompt-file is set.

The Economics#

Model Pricing#

Current pricing for models commonly used with Claude Code:

Model	Base Input	Cache Write (5min)	Cache Read	Output
Claude Opus 4.6	$5/MTok	$6.25/MTok	$0.50/MTok	$25/MTok
Claude Opus 4.5	$5/MTok	$6.25/MTok	$0.50/MTok	$25/MTok
Claude Sonnet 4.5	$3/MTok	$3.75/MTok	$0.30/MTok	$15/MTok
Claude Sonnet 4	$3/MTok	$3.75/MTok	$0.30/MTok	$15/MTok
Claude Haiku 4.5	$1/MTok	$1.25/MTok	$0.10/MTok	$5/MTok

The key ratio: cache reads are 10x cheaper than base input. This makes caching extremely effective for repeated prefixes.

Worked Example: A Typical Claude Code Session#

Assumptions:

15,000-token system prompt
200-message session
Opus 4.6 pricing

Without caching:

200 messages × 15,000 tokens × $5/MTok = $15.00
(just for the system prompt -- conversation tokens add more)

With caching:

Message 1 (cache write):
  15,000 tokens × $6.25/MTok = $0.09

Messages 2-200 (cache reads):
  199 × 15,000 tokens × $0.50/MTok = $1.49

Total system prompt cost: $0.09 + $1.49 = $1.58
Savings: $15.00 - $1.58 = $13.42 (89% reduction)

The conversation history also benefits from caching. As the conversation grows, previously sent turns become part of the cached prefix. Only the newest message is uncached.

The Real Cost of System Prompt Bloat#

Even with caching, system prompt size matters:

System Prompt Size	Without Caching (200 msgs)	With Caching (200 msgs)	Cache Reads Cost
10,000 tokens	$10.00	$1.06	$1.00
15,000 tokens	$15.00	$1.58	$1.49
20,000 tokens	$20.00	$2.11	$1.99
30,000 tokens	$30.00	$3.16	$2.99

(Opus 4.6 pricing)

The cost difference between a 10K and 30K system prompt is ~$2 per session with caching – noticeable over many sessions but not catastrophic. The more significant impact of a bloated system prompt is context window space: those 30,000 tokens are unavailable for conversation content regardless of caching.

Cache Mechanics#

Prefix Matching#

Cache hits require 100% identical content from the beginning of the prompt up to the cache breakpoint. Even a single character difference causes a cache miss for everything after the point of difference.

This is why the system prompt is ideal for caching – it’s assembled from the same sources every message and doesn’t change mid-session.

Minimum Token Requirements#

Not all prompts are eligible for caching. Each model has a minimum cacheable prefix length:

Model	Minimum Tokens
Claude Opus 4.6, Opus 4.5	4,096
Claude Sonnet 4.5, Sonnet 4, Opus 4.1, Opus 4	1,024
Claude Haiku 4.5	4,096

Claude Code system prompts are well above these minimums (typically 12,000-20,000 tokens), so caching always applies.

Cache Breakpoints#

The API supports up to 4 explicit cache breakpoints using cache_control parameters. Claude Code manages these internally – you don’t set them yourself.

The system automatically checks for cache hits by looking backwards from each breakpoint (up to 20 blocks), finding the longest matching prefix. In Claude Code, this means:

Tool definitions get cached as a block
System prompt gets cached as a block
Conversation history gets incrementally cached as it grows

Cache Invalidation Rules#

The cache follows the hierarchy: tools → system → messages. Changes at each level invalidate that level and everything after it.

tools change    → tools, system, messages all invalidated
system changes  → system and messages invalidated (tools still cached)
messages change → only messages invalidated (tools and system still cached)

Specific invalidation triggers:

Change	Tools Cache	System Cache	Messages Cache
Tool definitions modified	Invalidated	Invalidated	Invalidated
Web search toggled	Valid	Invalidated	Invalidated
Speed mode toggled	Valid	Invalidated	Invalidated
Tool choice changed	Valid	Valid	Invalidated
Images added/removed	Valid	Valid	Invalidated
Thinking params changed	Valid	Valid	Invalidated

What This Means for Optimization#

Cost Optimization vs Context Optimization#

Prompt caching creates an important distinction between two types of optimization:

Cost optimization – Reducing the dollar amount spent on input tokens. Caching handles this automatically and effectively. The 90% discount on cached reads means that even a large system prompt is cheap per-message.

Context optimization – Reducing the context window space consumed by the system prompt. Caching does not help here. A 20,000-token system prompt still occupies 20,000 tokens of your context window on every message, whether those tokens are cached or not.

Context window budget (e.g., 200K tokens):
┌──────────────────────────────────────────────┐
│ System prompt: 20,000 tokens                 │ ← Cheap (cached) but takes space
│ Conversation history: grows over time        │ ← New content, full price
│ Available for new content: what's left       │ ← This is what you're optimizing
└──────────────────────────────────────────────┘

This is why the token optimization article still matters even with caching. Disabling unused plugins saves context window space in addition to money.

When to Worry About Caching#

You generally don’t need to think about prompt caching in Claude Code – it’s handled automatically. But be aware of these scenarios:

Long idle periods – If you step away for more than 5 minutes between messages, the cache expires. The next message pays full price for a cache write. This is a one-time cost and the cache rebuilds immediately.
Very long sessions – Context compaction (summarizing old messages) changes the conversation history, causing partial cache invalidation. This is normal and expected.
Plugin/MCP changes mid-session – If you enable/disable plugins or MCP servers during a session, it invalidates the entire cache. Restart Claude Code to get clean caching.

When Not to Worry#

System prompt size (for cost) – With caching, the cost difference between a 10K and 30K system prompt is ~$2 per 200-message session. Not worth agonizing over.
Conversation length – Previous turns get cached incrementally. Long conversations are efficiently handled.
Cache management – Claude Code places breakpoints and manages cache strategy automatically. What you can change is the cache lifetime and whether caching runs at all, set via environment variables (see Configuring Cache Behavior).

Cache Break Detection#

Claude Code includes an internal diagnostic system (promptCacheBreakDetection.ts) that monitors cache effectiveness across API calls. It works in two phases:

Pre-call (recordPromptState): Before each API call, the system hashes the system prompt, tool schemas, model, beta headers, cache control settings, effort value, and other parameters that affect the server-side cache key. It compares these against the previous call’s state and records any differences as “pending changes.”
Post-call (checkResponseForCacheBreak): After receiving the API response, it compares cache_read_input_tokens against the previous call. A cache break is flagged when cache reads drop more than 5% and the absolute drop exceeds 2,000 tokens. The pending changes from phase 1 are used to explain the cause – system prompt mutation, tool schema change, model switch, beta header flip, TTL expiry (5-min or 1-hour), or server-side eviction.

Detected breaks are logged as tengu_prompt_cache_break analytics events with detailed attribution (which tools changed, character deltas, time gap). Compaction and cached microcompact deletions reset the baseline to avoid false positives. The system only tracks long-lived sources (main REPL thread, SDK, custom/default/builtin agents) – short-lived forked agents are excluded since they lack a meaningful comparison baseline.

References#

Prompt Caching (Anthropic Docs) – Full API documentation
Pricing – Current model pricing with caching multipliers
System Prompt Article – What gets cached in Claude Code
Token Optimization Article – Context window optimization (complementary to caching)