Disclaimer: Most of the information in this article is based on my experience with Codex (GPT models) and Claude Code (Anthropic models). While the core concepts are generally the same across AI coding agents, implementation details may vary between providers and models.

Why Prompt-Caching Matters

Cost reduction

AI coding agents are expensive, but they could have been much more expensive.

LLM inference is fundamentally stateless. Every new request must include the entire conversation history along with the latest user message. As a conversation grows, each subsequent prompt contains more tokens than the last.

Imagine if token pricing were completely flat—where a token in the first prompt costs exactly the same as a token in the tenth—the cost of a session would compound rapidly. Each new interaction would become increasingly expensive because you're repeatedly paying to resend the accumulated context, prompt after prompt.

To address this problem, AI agent providers introduce prompt-caching system, which basically caches computation outputs (KV cache) of previous context/messages. Note that they cache the computation output, not the plain text message or the encoded tokens . Since next prompts contains old prompts + new messages, without prompt-caching, LLM must recompute old context, hence increase cost.

Faster responses

With prompt caching, a large portion of input tokens can be skipped during processing because their KV states have already been computed. As a result, TTFT (Time to First Token) is significantly reduced.

This also helps maintain consistent agent performance during long-running sessions, since the model no longer needs to repeatedly process the same historical context on every request.

Prompt-caching overview

How prompts are constructed

Let's revisit the earlier statement about LLM is stateless and how prompts are constructed and sent to server.

Simplified mental model:

Prompt 1 (P1):
[System Prompt]                    |
[Tools]                            |
[Project Instructions]             | P1's input
[Codebase Context]                 |
[User messages 1]                  |
    -------
[Model Output 1]                   | P1's output

Prompt 2 (P2):
[System Prompt]                    |
[Tools]                            |
[Project Instructions]             |
[Codebase Context]                 | P2's input = P1 in + P1 out + Message 2
[User messages 1]                  |
[Model Output 1]                   |
[User messages 2]                  |

When users send the first prompt, the request contains much more than just the user's message. It also includes the system prompt, project instructions (e.g. AGENTS.md, CLAUDE.md), tool definitions (MCP servers, skills, etc.), and other contextual information.
When P2 is executed, the LLM has no memory of the previous interaction. To preserve context, P2 must include the entire previous exchange—P1 Input and P1 Output—followed by the new user message.

How prompts are cached

When P2 is executed, it contains the entire context from P1, which are previously computed. Recomputing this prefix would be redundant.

To avoid that, during P1 the server stores the computation results (KV cache) into the cache -- typically GPU.

When P2 arrives, the server reuses the cached KV states for the shared prefix and computes only the new messages. The resulting KV states are then appended to the cache, making them available for subsequent prompts.

In short, prompt caching lets the model reuse prefixes' computation instead of reprocessing the same context on every request.

Note that cache is used to avoid recompute a part of input tokens, it doesn't impact the output generation.

Key concepts

Prompt caching stores the entire prompt prefix. Any modification to the prefix invalidates the cache.
Cache prefixes are created in the following order: tools, system, then messages .
Cache retention:
- In-memory, for both Claude and Codex models, at volatile GPU-memory, from 5-minute to 1-hour.
  If you use Claude Code, your cache remain available for 1 hour in most of the cases.
- Extended prompt caching, available for GPT models only, extends cache lifetime to 24 hours by storing cache data in GPU-local storage. Claude models do NOT support this.
- Not all models follow the same cache retention policy:
  - In-memory cache isn't available for gpt-5.5 and gpt-5.5-pro. Their cache are always retained for 24h.
  - Extended cache is available only for the GPT-5.5 family of models.
Cache scopes:
- Global - The shared agent system prompts, benefit all AI agents users.
- Project-scoped - Local caches associated with a specific project and shared across sessions working within the same project.

Caching approaches

There are two primary prompt-caching approaches: automatic caching and explicit cache breakpoints.

Automatic caching

Both Codex and Claude support automatic caching.

With automatic caching, the system caches all content up to and including the last cacheable block. On subsequent requests with the same prefix, cached content is reused automatically. As conversations grow, the cache breakpoint automatically move forward.

Explicit cache breakpoints

Claude supports explicit cache breakpoints, while Codex does NOT.

With cache breakpoints, developers can explicitly mark sections of the prompt as cacheable. This provides finer control over cache behavior and helps maximize cache hits when working with large prompts that contain both static and dynamic content.

A practical way to understand prompt caching is to inspect the actual requests sent to the models.

I used mitmproxy to capture the requests from Claude Code. The following example shows a simplified request payload with explicit cache breakpoints:

// Claude code requests payload
{
    "model": "claude-sonnet-4-6",
    "tools": [
        // Tools's full definition
    ],
    "system": [
        {   "text": "You are Claude Code, Anthropic's official CLI for Claude."   },
        {
            "text": "<claude-system-prompt>",
            // IMPORTANT: Global cache, shared across all Claude users.
            "cache_control": {
                "type": "ephemeral",
                "ttl": "1h",
                "scope": "global"
            }
        },
        {
            // Some local paths, project path, git status are also included in here
            "text": "<memory-and-environment-prompt>",
            // IMPORTANT: Project-scoped cache, can be shared
            // across sessions if git status doesn't change
            "cache_control": {
                "type": "ephemeral",
                "ttl": "1h"
            }
        }
    ],
    "messages": [
        {
            "role": "user",
            "content": [
                {   "text": "<built-in-tools-declaration>"  },
                {   "text": "<agents-declaration>"  },
                {   "text": "<mcp-servers-declaration>"  },
                {   "text": "<skills-declaration>"  },
                {   "text": "Auto-Mode Active"  },
                {
                    // Claude injects session-specific data such as local file
                    // paths, user email, and the current date into the prompt. 
                    // 
                    // As a result, these prompt blocks become mostly unique to each
                    // session and cannot be effectively shared through prompt caching.
                    // 
                    // Consider upvoting 
                    // https://github.com/anthropics/claude-code/issues/53231 
                    // so that Claude addresses this issue
                    "text": "<~/.claude/CLAUDE.md + project/CLAUDE.md>"
                },
                {
                    "text": "Hi, how are you today?",
                    "cache_control": { 
                        "type": "ephemeral",
                        "ttl": "1h"
                    }
                }
            ]
        }
    ]
}

Claude uses cache_control field to define cache breakpoints. In the payload above:

There are 3 cache_control flags, corresponding to 3 blocks of cache
```
[Tools]
[System 1]                 - cache_control n°1
[System/Environment 2]     - cache_control n°2
[Messages 1]
[Messages 2]               - cache_control n°3
```
- If Message 1, Message 2 change -> Cache 3 is invalidated ❌, cache 1 + 2 stay ✅.
- If Environment 2 changes -> Cache 2-3 are invalidated ❌, cache 1 stays ✅.
- If Tools, System 1 changes -> Cache free next prompt ❌.
- Claude allows maximum 4 cache break points cache_control
All the cache are retained for 1 hour.
Many pieces of environment data can affect cache validity: git branch, git status, working directory, current date etc.

Codex does NOT follow this cache_control system.

Codex request payload sample

// Code requests payload
{
  "type": "response.create",
  "model": "gpt-5.5",
  "instructions": "<Codex system prompt>",
  "input": [
    {
      "type": "message",
      "role": "developer",
      "content": [
        { "text": "<permissions instructions>" },
        { "text": "<collaboration_mode>" }, // Default Mode, Plan Mode
        { "text": "<apps_instructions>" },
        { "text": "<skills_instructions>" },
        { "text": "<plugins_instructions>" }
      ]
    },
    {
      "type": "message",
      "role": "user",
      "content": [
        { "text": "<AGENTS.md : Global + Project-scoped>" },
        { "text": "<environment_context>" } // Local paths, current date,
      ]
    },
    {
      "type": "message",
      "role": "user",
      "content": [
        { "text": "Hello there!" }
      ]
    }
  ],
  "tools": [
    // tools definition
  ],
  "prompt_cache_key": "{thread/session cache key}",
  "client_metadata": {
    "{installation/session id/window metadata}"
  }
}

Cache routing

What happens if a request is routed to a different server than the one holding the cached prefix?

Simply answer -> Cache missed.

Since prompt caches are typically stored either in-memory, or in server-local storage, a request arriving at a different server generally cannot reuse the cached prefix, even if the prompt itself is identical.

GPT models address this through a parameter prompt_cache_key. In codex, this parameter is set to session id. Request sharing the same prompt_cache_key are most likely to be routed to the same backend servers, increasing the rate for cache hit.

At the time of writing this article, Anthropic does not publicly document an equivalent mechanism..

Prompt cache cost

In general, cache hits significantly reduce costs compared to processing fresh input tokens. For both Claude and Codex, cached tokens are typically 10× cheaper.
In some cases, especially with Anthropic models, writing cache cost more than regular input tokens. For Codex, there's no separate cache-write cost, cache creation is billed as normal input tokens.

Inspecting Prompt Cache Usage

Both Codex and Claude Code expose token usage information for each request, making it possible to observe how prompt caching behaves in practice.

Codex

Session data is are stored under:

$HOME/.codex/sessions/{year}/{month}/{date}/{session_id}.json

{
  "total_token_usage": {
    "input_tokens": 51951,
    "cached_input_tokens": 23168,
    "output_tokens": 418,
    "reasoning_output_tokens": 233,
    "total_tokens": 52369
  }
}

Claude code

Session logs are found at:

$HOME/.claude/projects/{project}/{sessionId}.json

{
  "usage": {
    "input_tokens": 1,
    "cache_creation_input_tokens": 642,
    "cache_read_input_tokens": 39187,
    "output_tokens": 689
  }
}

Based on these logs, I built a tool to monitor AI coding agent usage across providers, including cache utilization, token consumption, and cache hit percentages.

https://github.com/lamphamTL/ai-plugins

Claude vs Codex caching

		Anthropic models	GPT models
Cache retention	5m	✅	✅
	1h	✅	✅
	24h	❌	✅
Caching approaches	Automatic caching	✅	✅
	Cache breakpoints	✅	❌
	Mix: Automatic + cache breakpoints	✅	❌
Caching server routing		❌	✅
Cache write cost		2x input token	1x input tokens
Environment data excluded from prompt inputs	Project path	❌	❌
	Current date (only date, not time)	❌	❌
	user email	❌	✅
	git branch / git status	❌	✅
Cache shareability	System prompt shared across global users	✅❓(Not documented, based on payload analysis and cache monitoring)	✅❓(Not documented, based on payload analysis and cache monitoring)
	Cache shared across 2 remote workplaces with the same setup (same project path) - this is typical CI scenario	✅❓(if git status returns same things, and prompts sent to the same servers)	✅❓(if prompts sent to the same servers)

What's next

In this article, we've focused on the prompt-caching mechanics: cache scopes, cache retention, cache breakpoints, routing, and pricing; and how they differ between Codex and Claude.

In Part 2, we'll move from theory to practice and explore a set of habits and workflows that help maximize cache reuse when working with AI coding agents such as Claude Code and Codex.

Building AI Agent Coding Habits Around Prompt-caching - P1

Comments (1)

AI Agent Coding Deep Dive

More from this blog

Beyond Big-O: How Hardware Shapes Code Performance.

Optimizing Large Android Project Builds : When Recommended Tunings Aren't Enough - Part 2

Optimizing Large Android Project Builds : When Recommended Tunings Aren't Enough - Part 1

Why Prompt-Caching Matters

Cost reduction

Faster responses

Prompt-caching overview

How prompts are constructed

How prompts are cached

Key concepts

Caching approaches

Automatic caching

Explicit cache breakpoints

Cache routing

Prompt cache cost

Inspecting Prompt Cache Usage

Claude vs Codex caching

What's next

Command Palette

Comments (1)

AI Agent Coding Deep Dive

More from this blog

Why Prompt-Caching Matters

Cost reduction

Faster responses

Prompt-caching overview

How prompts are constructed

How prompts are cached

Key concepts

Caching approaches

Automatic caching

Explicit cache breakpoints

Cache routing

Prompt cache cost

Inspecting Prompt Cache Usage

Claude vs Codex caching

What's next