Prompt Caching: Why Your System Prompt Doesn’t Cost What You Think#
Executive Summary#
Prompt caching allows the API to reuse previously processed prompt prefixes, reducing both cost and latency. Since Claude Code re-sends the system prompt on every API call, caching is what makes large system prompts economically viable. Without caching, a 200-message session with a 15,000-token system prompt would cost ~$15 on Opus 4.6. With caching, it costs ~$1.60 – an 89% reduction.
| Metric | Without Caching | With Caching | Savings |
|---|---|---|---|
| System prompt cost (200 msgs, Opus 4.6) | ~$15.00 | ~$1.60 | 89% |
| System prompt cost (200 msgs, Sonnet 4.5) | ~$9.00 | ~$0.96 | 89% |
| Latency (long prompts) | Full processing | Up to 85% faster | Significant |
Pricing multipliers (all models):
| Operation | Multiplier vs Base Input | Opus 4.6 ($5 base) | Sonnet 4.5 ($3 base) |
|---|---|---|---|
| Cache write (5-min TTL) | 1.25x | $6.25/MTok | $3.75/MTok |
| Cache write (1-hour TTL) | 2x | $10/MTok | $6/MTok |
| Cache read | 0.1x | $0.50/MTok | $0.30/MTok |
| Uncached input | 1x | $5/MTok | $3/MTok |
Table of Contents#
- Prompt Caching: Why Your System Prompt Doesn’t Cost What You Think
How Prompt Caching Works#
The Basic Mechanism#
Every API call sends the full prompt to the model. Without caching, the model processes every token from scratch each time. With caching, the model can reuse previously processed prefixes:
Message 1 (no cache exists yet):
┌─────────────────────────────────────────┐
│ System prompt (15,000 tokens) │ ← Processed and written to cache
│ User message (50 tokens) │ ← Processed normally
└─────────────────────────────────────────┘
Cost: 15,000 × $6.25/MTok (cache write) + 50 × $5/MTok (uncached)
Message 2 (cache hit):
┌─────────────────────────────────────────┐
│ System prompt (15,000 tokens) │ ← Read from cache (90% cheaper)
│ Previous conversation (500 tokens) │ ← Processed normally
│ User message (50 tokens) │ ← Processed normally
└─────────────────────────────────────────┘
Cost: 15,000 × $0.50/MTok (cache read) + 550 × $5/MTok (uncached)The cache operates on prefix matching – it caches everything from the beginning of the prompt up to a designated breakpoint. The prefix must be identical between requests for a cache hit.
Cache Lifetime (TTL)#
- Default: 5 minutes – Refreshed every time the cached content is used (so active sessions keep the cache alive indefinitely)
- Optional: 1 hour – Costs more to write (2x base instead of 1.25x) but survives longer idle periods
- No manual cache clearing – entries expire automatically after the TTL without use
For interactive Claude Code sessions the 5-minute default is fine: as long as you send messages within 5 minutes of each other, the cache stays warm. For a longer-lived cache, the TTL is configurable through environment variables (see Configuring Cache Behavior). Subscription (Pro/Max) users get the 1-hour TTL automatically while usage stays within plan limits; once you exceed the limit and draw on usage credits, the TTL drops back to 5 minutes because that usage is billed per token.
What Gets Cached#
The cache covers the full prompt prefix in order: tools → system → messages. This order forms a hierarchy where each level builds on the previous.
Cacheable content includes:
- Tool definitions (including MCP tools)
- System messages (instructions, CLAUDE.md content, skill catalogs)
- Conversation messages (user turns, assistant turns, tool results)
- Images and documents in messages
- Tool use and tool result blocks
Not cacheable: Thinking blocks (from extended thinking) cannot be explicitly cached, though they get cached implicitly when passed back in tool use flows.
Prompt Caching in Claude Code#
Why It Matters for Claude Code#
Claude Code is uniquely positioned to benefit from prompt caching because its system prompt is:
- Large – Typically 12,000-20,000 tokens (see the system prompt article)
- Stable – The same content is sent on every message within a session
- Repeated frequently – Sessions routinely involve 50-200+ API calls
Without caching, a Claude Code session would be prohibitively expensive. The system prompt alone would cost $15-20 per 200-message session on Opus 4.6. Caching brings that down to ~$1.50-2.00.
The Cache Hierarchy#
Claude Code’s prompt follows the standard cache hierarchy:
1. Tool definitions (Read, Edit, Bash, Glob, Grep, Task, etc.)
+ MCP tool definitions (context7, episodic-memory, etc.)
│
2. System messages
├── Core instructions (safety rules, behavior guidelines)
├── CLAUDE.md files (all scopes)
├── Skill catalog (names + descriptions)
├── Subagent catalog (in Task tool definition)
├── MCP server instructions
└── Environment context (git status, working directory)
│
3. Conversation messages
├── User messages
├── Assistant responses
├── Tool calls and results
└── System reminders (hook outputs, plugin status)Levels 1 and 2 are nearly identical across messages – this is what gets cached. Level 3 grows with each message; the new portions are uncached.
What Stays Cached Between Messages#
In a typical Claude Code session:
| Content | Cached? | Why |
|---|---|---|
| Tool definitions | Yes (after first message) | Identical every message |
| Core instructions | Yes | Identical every message |
| CLAUDE.md files | Yes | Identical every message |
| Skill catalog | Yes | Identical every message |
| Subagent catalog | Yes | Identical every message |
| Previous conversation turns | Mostly | Prefix is identical; only new turns are uncached |
| Latest user message | No | New content each time |
| System reminders | Partially | Some are identical, some vary |
The result: after the first message, roughly 80-95% of input tokens on subsequent messages are cache reads at 10% of the base price.
What Breaks the Cache#
Changes to the prompt prefix invalidate the cache from that point forward:
| Change | Impact |
|---|---|
| Tool definition changes | Invalidates everything (tools are first in hierarchy) |
| System prompt changes | Invalidates system + messages cache |
| Enabling/disabling a plugin mid-session | Full cache invalidation (changes tool definitions + system prompt) |
| Adding/removing an MCP server | Full cache invalidation |
| Context compaction (long sessions) | Conversation history changes, partial invalidation |
In practice, cache invalidation rarely happens within a Claude Code session. The system prompt doesn’t change mid-session. The main cause of partial invalidation is context compaction in very long sessions, where older messages get summarized.
Configuring Cache Behavior#
Breakpoint placement is automatic. A few environment variables set the cache TTL or turn caching off, and one CLI flag controls the cached prefix:
| Setting | Effect |
|---|---|
ENABLE_PROMPT_CACHING_1H=1 | Request a 1-hour cache TTL instead of the 5-minute default (API key, Bedrock, Vertex, Foundry, Claude Platform on AWS). 1-hour writes cost 2x base. (Subscription users already get the 1-hour TTL automatically, as noted above.) |
FORCE_PROMPT_CACHING_5M=1 | Force the 5-minute TTL even where 1-hour would otherwise apply, overriding ENABLE_PROMPT_CACHING_1H (including one set in managed settings). Useful for debugging or comparing the two TTLs. |
ENABLE_PROMPT_CACHING_1H_BEDROCK | Deprecated – use ENABLE_PROMPT_CACHING_1H. |
DISABLE_PROMPT_CACHING=1 | Disable caching for all models (takes precedence over the per-model variants DISABLE_PROMPT_CACHING_OPUS, DISABLE_PROMPT_CACHING_SONNET, DISABLE_PROMPT_CACHING_HAIKU, and DISABLE_PROMPT_CACHING_FABLE). Claude Code prints a startup warning whenever caching is disabled this way. |
For scripted, multi-user headless runs, claude -p --exclude-dynamic-system-prompt-sections "<query>" moves the per-machine system-prompt sections (working directory, environment info, memory paths, git-repo flag) into the first user message. The cached prefix then stays identical across users and machines running the same task. The flag applies only to the default system prompt; it is ignored when --system-prompt or --system-prompt-file is set.
The Economics#
Model Pricing#
Current pricing for models commonly used with Claude Code:
| Model | Base Input | Cache Write (5min) | Cache Read | Output |
|---|---|---|---|---|
| Claude Opus 4.6 | $5/MTok | $6.25/MTok | $0.50/MTok | $25/MTok |
| Claude Opus 4.5 | $5/MTok | $6.25/MTok | $0.50/MTok | $25/MTok |
| Claude Sonnet 4.5 | $3/MTok | $3.75/MTok | $0.30/MTok | $15/MTok |
| Claude Sonnet 4 | $3/MTok | $3.75/MTok | $0.30/MTok | $15/MTok |
| Claude Haiku 4.5 | $1/MTok | $1.25/MTok | $0.10/MTok | $5/MTok |
The key ratio: cache reads are 10x cheaper than base input. This makes caching extremely effective for repeated prefixes.
Worked Example: A Typical Claude Code Session#
Assumptions:
- 15,000-token system prompt
- 200-message session
- Opus 4.6 pricing
Without caching:
200 messages × 15,000 tokens × $5/MTok = $15.00
(just for the system prompt -- conversation tokens add more)With caching:
Message 1 (cache write):
15,000 tokens × $6.25/MTok = $0.09
Messages 2-200 (cache reads):
199 × 15,000 tokens × $0.50/MTok = $1.49
Total system prompt cost: $0.09 + $1.49 = $1.58
Savings: $15.00 - $1.58 = $13.42 (89% reduction)The conversation history also benefits from caching. As the conversation grows, previously sent turns become part of the cached prefix. Only the newest message is uncached.
The Real Cost of System Prompt Bloat#
Even with caching, system prompt size matters:
| System Prompt Size | Without Caching (200 msgs) | With Caching (200 msgs) | Cache Reads Cost |
|---|---|---|---|
| 10,000 tokens | $10.00 | $1.06 | $1.00 |
| 15,000 tokens | $15.00 | $1.58 | $1.49 |
| 20,000 tokens | $20.00 | $2.11 | $1.99 |
| 30,000 tokens | $30.00 | $3.16 | $2.99 |
(Opus 4.6 pricing)
The cost difference between a 10K and 30K system prompt is ~$2 per session with caching – noticeable over many sessions but not catastrophic. The more significant impact of a bloated system prompt is context window space: those 30,000 tokens are unavailable for conversation content regardless of caching.
Cache Mechanics#
Prefix Matching#
Cache hits require 100% identical content from the beginning of the prompt up to the cache breakpoint. Even a single character difference causes a cache miss for everything after the point of difference.
This is why the system prompt is ideal for caching – it’s assembled from the same sources every message and doesn’t change mid-session.
Minimum Token Requirements#
Not all prompts are eligible for caching. Each model has a minimum cacheable prefix length:
| Model | Minimum Tokens |
|---|---|
| Claude Opus 4.6, Opus 4.5 | 4,096 |
| Claude Sonnet 4.5, Sonnet 4, Opus 4.1, Opus 4 | 1,024 |
| Claude Haiku 4.5 | 4,096 |
Claude Code system prompts are well above these minimums (typically 12,000-20,000 tokens), so caching always applies.
Cache Breakpoints#
The API supports up to 4 explicit cache breakpoints using cache_control parameters. Claude Code manages these internally – you don’t set them yourself.
The system automatically checks for cache hits by looking backwards from each breakpoint (up to 20 blocks), finding the longest matching prefix. In Claude Code, this means:
- Tool definitions get cached as a block
- System prompt gets cached as a block
- Conversation history gets incrementally cached as it grows
Cache Invalidation Rules#
The cache follows the hierarchy: tools → system → messages. Changes at each level invalidate that level and everything after it.
tools change → tools, system, messages all invalidated
system changes → system and messages invalidated (tools still cached)
messages change → only messages invalidated (tools and system still cached)Specific invalidation triggers:
| Change | Tools Cache | System Cache | Messages Cache |
|---|---|---|---|
| Tool definitions modified | Invalidated | Invalidated | Invalidated |
| Web search toggled | Valid | Invalidated | Invalidated |
| Speed mode toggled | Valid | Invalidated | Invalidated |
| Tool choice changed | Valid | Valid | Invalidated |
| Images added/removed | Valid | Valid | Invalidated |
| Thinking params changed | Valid | Valid | Invalidated |
What This Means for Optimization#
Cost Optimization vs Context Optimization#
Prompt caching creates an important distinction between two types of optimization:
Cost optimization – Reducing the dollar amount spent on input tokens. Caching handles this automatically and effectively. The 90% discount on cached reads means that even a large system prompt is cheap per-message.
Context optimization – Reducing the context window space consumed by the system prompt. Caching does not help here. A 20,000-token system prompt still occupies 20,000 tokens of your context window on every message, whether those tokens are cached or not.
Context window budget (e.g., 200K tokens):
┌──────────────────────────────────────────────┐
│ System prompt: 20,000 tokens │ ← Cheap (cached) but takes space
│ Conversation history: grows over time │ ← New content, full price
│ Available for new content: what's left │ ← This is what you're optimizing
└──────────────────────────────────────────────┘This is why the token optimization article still matters even with caching. Disabling unused plugins saves context window space in addition to money.
When to Worry About Caching#
You generally don’t need to think about prompt caching in Claude Code – it’s handled automatically. But be aware of these scenarios:
- Long idle periods – If you step away for more than 5 minutes between messages, the cache expires. The next message pays full price for a cache write. This is a one-time cost and the cache rebuilds immediately.
- Very long sessions – Context compaction (summarizing old messages) changes the conversation history, causing partial cache invalidation. This is normal and expected.
- Plugin/MCP changes mid-session – If you enable/disable plugins or MCP servers during a session, it invalidates the entire cache. Restart Claude Code to get clean caching.
When Not to Worry#
- System prompt size (for cost) – With caching, the cost difference between a 10K and 30K system prompt is ~$2 per 200-message session. Not worth agonizing over.
- Conversation length – Previous turns get cached incrementally. Long conversations are efficiently handled.
- Cache management – Claude Code places breakpoints and manages cache strategy automatically. What you can change is the cache lifetime and whether caching runs at all, set via environment variables (see Configuring Cache Behavior).
Cache Break Detection#
Claude Code includes an internal diagnostic system (promptCacheBreakDetection.ts) that monitors cache effectiveness across API calls. It works in two phases:
- Pre-call (
recordPromptState): Before each API call, the system hashes the system prompt, tool schemas, model, beta headers, cache control settings, effort value, and other parameters that affect the server-side cache key. It compares these against the previous call’s state and records any differences as “pending changes.” - Post-call (
checkResponseForCacheBreak): After receiving the API response, it comparescache_read_input_tokensagainst the previous call. A cache break is flagged when cache reads drop more than 5% and the absolute drop exceeds 2,000 tokens. The pending changes from phase 1 are used to explain the cause – system prompt mutation, tool schema change, model switch, beta header flip, TTL expiry (5-min or 1-hour), or server-side eviction.
Detected breaks are logged as tengu_prompt_cache_break analytics events with detailed attribution (which tools changed, character deltas, time gap). Compaction and cached microcompact deletions reset the baseline to avoid false positives. The system only tracks long-lived sources (main REPL thread, SDK, custom/default/builtin agents) – short-lived forked agents are excluded since they lack a meaningful comparison baseline.
References#
- Prompt Caching (Anthropic Docs) – Full API documentation
- Pricing – Current model pricing with caching multipliers
- System Prompt Article – What gets cached in Claude Code
- Token Optimization Article – Context window optimization (complementary to caching)