Cost Tracking and Budgets#
The Cost Challenge#
Bedrock bills per-token. 500 developers using Opus for everything can cost $50K–$200K+/month depending on usage intensity. Without controls, costs are unpredictable and can spike when developers discover long-running agentic workflows.
Model Tiering Strategy#
The LLM gateway is the control point for cost management.
| Use Case | Model | Approx. Cost | Access |
|---|---|---|---|
| Routine coding, quick edits | Sonnet | Lower per-token | Default for all developers |
| Architecture, complex reasoning | Opus | Higher per-token | Gated to senior engineers or by request |
| Summarization, classification | Haiku | Lowest per-token | Claude Code uses automatically as fast model |
Implementation#
Configure the LLM gateway to:
- Default all requests to Sonnet
- Route to Opus only for users/teams with explicit Opus access
- Or allow developers to select via
ANTHROPIC_MODELbut with higher budget scrutiny for Opus users
Budget Controls#
Per-Team Monthly Budgets#
Set via the LLM gateway. When a team reaches 80% of budget, warn the team lead. At 100%, throttle (don’t hard-block) to prevent disruption.
Per-User Daily Limits#
Optional guardrail to prevent individual developers from consuming disproportionate resources. Set based on Cohort 1/2 usage data.
Typical Usage Expectations#
Based on industry data and Cohort 1 calibration:
- Light users (occasional queries): 20K–50K tokens/day
- Moderate users (regular code generation): 50K–150K tokens/day
- Heavy users (agentic workflows, long sessions): 150K–500K tokens/day
Expect 60% light, 30% moderate, 10% heavy across 500 developers.
AWS Cost Visibility#
Cost Explorer#
- Scope to the dedicated Bedrock AWS account
- Tag costs by team using gateway metadata → CloudWatch → Cost Allocation Tags
- Monthly cost reports to finance and engineering leadership
CloudWatch Dashboards#
Deploy dashboards showing:
- Total token consumption (daily, weekly, monthly)
- Per-team token consumption
- Per-model token consumption (Sonnet vs. Opus vs. Haiku)
- Cost trends and projections
- Top 10 users by consumption (for outlier detection, not surveillance)
Prompt Caching#
Bedrock supports prompt caching, which can reduce costs by up to 90% for repeated context patterns (like CLAUDE.md and rules that load every session). Monitor cache hit rates during Cohort 1 and optimize:
- Stable context (CLAUDE.md, rules) benefits most from caching
- Frequently-changing context (code files) benefits less
- Prompt caching behavior on Bedrock may differ from direct API – test explicitly
Extended Thinking Costs#
Opus uses extended thinking – a reasoning phase where the model works through complex problems before responding. Thinking tokens are billed as output tokens at the full output rate.
Why This Matters for Cost Planning#
- Thinking tokens can be 2-10x the visible output tokens on complex tasks
- A developer using Opus for architecture work might generate 50K+ thinking tokens per session
- These tokens don’t appear in Claude’s response but show up in your bill
- The
MAX_THINKING_TOKENSenv var can cap thinking budget (default: 31,999, max: 63,999)
Cost Mitigation#
- Default developers to Sonnet (no extended thinking cost)
- Gate Opus access to senior engineers or specific use cases via the LLM gateway
- Monitor thinking token consumption separately in gateway metrics – if your gateway logs input/output tokens, thinking tokens appear in the output count
- Set
MAX_THINKING_TOKENSin managed-settings.jsonenvto cap per-request thinking cost for Opus users
Rough Cost Impact#
At Bedrock Opus pricing, thinking tokens cost the same as output tokens. A heavy Opus user generating 100K thinking tokens/day adds measurable cost. Factor this into per-user budgets for Opus-authorized developers.
Provisioned Throughput#
At 500 developers, on-demand Bedrock may hit rate limits during peak hours. Consider provisioned throughput for Sonnet to guarantee capacity. Provisioned throughput costs more but provides:
- Guaranteed request rate
- Predictable latency
- No throttling during peak usage
Evaluate after Cohort 2 based on observed peak concurrent usage.