Slash LLM Bills: The Developer's Guide to Prompt Caching

Introduction to Prompt Caching

For LLM applications handling large system prompts, codebase contexts, or extensive chat histories, API costs are dominated by repeatedly parsing the same input prefix.

Prompt Caching solves this by storing the parsed input tokens in high-speed server-side memory. When a subsequent request begins with the same prefix, the model skips parsing and retrieves the pre-computed state, delivering dramatically lower latencies and massive cost discounts.

Provider Cache Pricing Comparison

Different providers offer varying cache discounts and lifetimes. The table below outlines cache rates per million tokens as of June 2026:

Provider	Model	Input Price (Standard/1M)	Cache Write Price / 1M	Cache Read Price / 1M	Save Ratio
Anthropic	Claude 3.5 Sonnet	$3.00	$3.75	$0.30	90%
OpenAI	GPT-4o	$2.50	$2.50 (Auto-cached)	$1.25	50%
Google	Gemini 1.5 Pro	$1.25	$1.25	$0.31	75%
DeepInfra	Llama 3.1 70B	$0.52	$0.52	$0.13	75%

*Note: Anthropic charges a ~25% premium to write the cache, but offers a 90% discount on reads. OpenAI automatically applies a 50% discount on cache hits without write premiums.*

Developer Implementation Rules

To make prompt caching work, you must structure your API requests strategically:

1. Put Static Data First: The cache matches prefixes from character 0. Place your system instructions, tool definitions, and long documents at the *very beginning* of the prompt. Dynamic components like user messages or random seeds must sit at the *very end*. 2. Align on Block Boundaries: Some providers cache in fixed block increments (e.g., Anthropic caches in segments of 1,024 tokens; Gemini caches in blocks of 32,768 tokens). Ensure your inputs exceed these minimums to trigger caching. 3. Manage Cache Lifetime: Cache entries expire after periods of inactivity (typically 5 to 10 minutes). For low-frequency applications, implement a periodic 'keep-alive' query to preserve the cache state for active sessions.