Cost Tracking and Budgets#

The Cost Challenge#

Bedrock bills per-token. 500 developers using Opus for everything can cost $50K–$200K+/month depending on usage intensity. Without controls, costs are unpredictable and can spike when developers discover long-running agentic workflows.

Model Tiering Strategy#

The LLM gateway is the control point for cost management.

Use CaseModelApprox. CostAccess
Routine coding, quick editsSonnetLower per-tokenDefault for all developers
Architecture, complex reasoningOpusHigher per-tokenGated to senior engineers or by request
Summarization, classificationHaikuLowest per-tokenClaude Code uses automatically as fast model

Implementation#

Configure the LLM gateway to:

  • Default all requests to Sonnet
  • Route to Opus only for users/teams with explicit Opus access
  • Or allow developers to select via ANTHROPIC_MODEL but with higher budget scrutiny for Opus users

Budget Controls#

Per-Team Monthly Budgets#

Set via the LLM gateway. When a team reaches 80% of budget, warn the team lead. At 100%, throttle (don’t hard-block) to prevent disruption.

Per-User Daily Limits#

Optional guardrail to prevent individual developers from consuming disproportionate resources. Set based on Cohort 1/2 usage data.

Typical Usage Expectations#

Based on industry data and Cohort 1 calibration:

  • Light users (occasional queries): 20K–50K tokens/day
  • Moderate users (regular code generation): 50K–150K tokens/day
  • Heavy users (agentic workflows, long sessions): 150K–500K tokens/day

Expect 60% light, 30% moderate, 10% heavy across 500 developers.

AWS Cost Visibility#

Cost Explorer#

  • Scope to the dedicated Bedrock AWS account
  • Tag costs by team using gateway metadata → CloudWatch → Cost Allocation Tags
  • Monthly cost reports to finance and engineering leadership

CloudWatch Dashboards#

Deploy dashboards showing:

  • Total token consumption (daily, weekly, monthly)
  • Per-team token consumption
  • Per-model token consumption (Sonnet vs. Opus vs. Haiku)
  • Cost trends and projections
  • Top 10 users by consumption (for outlier detection, not surveillance)

Slicing Metrics by Team and Entrypoint#

Claude Code emits its own OpenTelemetry metrics (token counts, cost, lines of code) alongside the gateway logs. Two attributes let you break those metrics down without parsing prompt content.

Set custom dimensions through OTEL_RESOURCE_ATTRIBUTES. As of Claude Code v2.1.161, its keys are attached as labels on every metric datapoint, so you can group usage by team, department, or repository:

export OTEL_RESOURCE_ATTRIBUTES="department=engineering,team.id=platform,cost_center=eng-123"

Attachment is on by default; set OTEL_METRICS_INCLUDE_RESOURCE_ATTRIBUTES=false to suppress it. Values must be comma-separated key=value pairs with no spaces (percent-encode any spaces in values).

To see how a session was launched, opt in to the app.entrypoint metric attribute (default: excluded):

export OTEL_METRICS_INCLUDE_ENTRYPOINT=true

It reports values such as cli, sdk-cli, sdk-ts, sdk-py, or claude-vscode, which separates interactive developer usage from headless SDK and CI consumption when you attribute cost.

Prompt Caching#

Bedrock supports prompt caching, which can reduce costs by up to 90% for repeated context patterns (like CLAUDE.md and rules that load every session). Monitor cache hit rates during Cohort 1 and optimize:

  • Stable context (CLAUDE.md, rules) benefits most from caching
  • Frequently-changing context (code files) benefits less
  • Prompt caching behavior on Bedrock may differ from direct API – test explicitly

Extended Thinking Costs#

Opus uses extended thinking – a reasoning phase where the model works through complex problems before responding. Thinking tokens are billed as output tokens at the full output rate.

Why This Matters for Cost Planning#

  • Thinking tokens can be 2-10x the visible output tokens on complex tasks
  • A developer using Opus for architecture work might generate 50K+ thinking tokens per session
  • These tokens don’t appear in Claude’s response but show up in your bill
  • The MAX_THINKING_TOKENS env var can cap thinking budget (default: 31,999, max: 63,999)

Cost Mitigation#

  • Default developers to Sonnet (no extended thinking cost)
  • Gate Opus access to senior engineers or specific use cases via the LLM gateway
  • Monitor thinking token consumption separately in gateway metrics – if your gateway logs input/output tokens, thinking tokens appear in the output count
  • Set MAX_THINKING_TOKENS in managed-settings.json env to cap per-request thinking cost for Opus users

Rough Cost Impact#

At Bedrock Opus pricing, thinking tokens cost the same as output tokens. A heavy Opus user generating 100K thinking tokens/day adds measurable cost. Factor this into per-user budgets for Opus-authorized developers.

Provisioned Throughput#

At 500 developers, on-demand Bedrock may hit rate limits during peak hours. Consider provisioned throughput for Sonnet to guarantee capacity. Provisioned throughput costs more but provides:

  • Guaranteed request rate
  • Predictable latency
  • No throttling during peak usage

Evaluate after Cohort 2 based on observed peak concurrent usage.