Skip to content
Go back

Why AI Coding Tools Are Getting Cheaper: Prompt Caching Explained

If you’ve used AI coding assistants lately — Cursor, Claude Code, GitHub Copilot, Windsurf, Kiro, Cline, Aider — you’ve probably noticed they feel faster and cheaper than they did a year ago. The models haven’t shrunk. The context windows have grown. So what changed?

The answer is mostly infrastructure, and specifically one technique: prompt caching.

This isn’t a minor optimization. It changes the economics of AI-assisted work in a way that’s worth understanding — not just for coding, but for any workflow where you’re repeatedly sending overlapping context to a model.


How It Works

Every time you send a request to an LLM, the model processes your entire input from scratch. If your prompt includes system instructions, file contents, or conversation history that doesn’t change between requests, you’re paying to reprocess the same tokens on every call.

Prompt caching breaks that pattern. After the first request, the static portions of your prompt are stored. Subsequent requests reuse the cached content at a fraction of the cost.

Without caching:

Request 1: [System Prompt] + [File Contents] + [Question 1] → full processing
Request 2: [System Prompt] + [File Contents] + [Question 2] → full processing again
Request 3: [System Prompt] + [File Contents] + [Question 3] → full processing again

With caching:

Request 1: [System Prompt] + [File Contents] → cached + [Question 1]
Request 2: [cached content retrieved] + [Question 2] → 90% cheaper
Request 3: [cached content retrieved] + [Question 3] → 90% cheaper

The pricing structure makes this concrete. Using Claude Sonnet as a reference point (Anthropic publishes these rates publicly):

TierRateWhen it applies
Cache write1.25x standardFirst time content is cached
Cache read0.1x standardReusing cached content
Standard input1xUncached tokens

At $3.00 per million input tokens as the baseline, cache reads cost $0.30 — a 90% discount. The cache write costs $3.75, 25% more than standard. You break even after exactly one reuse. Everything after that is savings.


Why 5-Minute TTL Is the Standard

Most providers offer two time-to-live (TTL) options: 5 minutes and 1 hour (with 1-hour TTL becoming more available in 2025-2026 on newer models). For coding tools, the 5-minute window has become the default — and it’s not arbitrary.

Coding happens in tight iteration loops:

00:00 — open file, ask "how does this function work?"
00:30 — ask "can you optimize this loop?"
01:00 — ask "what's the time complexity?"
01:30 — ask "add error handling here"

Every query after the first hits the cache. A focused 20-minute debugging session might involve 15 questions, all against the same file context. With 5-minute caching and cache-read pricing, costs drop roughly 80% versus uncached.

The 5-minute window also matches how code changes. During active development, files evolve constantly. A longer-lived cache could hold a stale version of the file you just edited, and the model would suggest changes based on outdated context. The 5-minute TTL is short enough to stay current, long enough to cover an active session.

There’s also a cost argument for preferring 5-minute over 1-hour when you’re in an active session. The 1-hour cache write costs roughly 2x standard input (versus 1.25x for 5-minute). For a 20K token file with 10 queries over 15 minutes, 5-minute caching is actually cheaper than 1-hour because the write cost is lower and cache hits keep refreshing the window.

The 5-minute approach has one real weakness: if you step away for more than 5 minutes, the cache expires and you pay full price on the next request. That’s the only compelling use case for 1-hour TTL in a coding context — covering interruptions.


What Gets Cached

Most coding tools handle caching internally — you can’t configure TTL in Cursor or GitHub Copilot. But understanding what gets cached helps you work with the grain of the system rather than against it.

Tools generally cache:

They don’t cache entire codebases. Instead, they use techniques like semantic search to retrieve relevant files, summarization to compress large files, and sliding windows to keep recent conversation while compressing older history. What gets cached is the result of that selection — a targeted, manageable context rather than a full repo dump.

How the major tools approach it

Cursor and Windsurf handle caching entirely internally, with no user-facing configuration. They use context compression and semantic retrieval to keep cached contexts small and relevant.

Claude Code currently uses 5-minute TTL across the board. There’s an open feature request for 1-hour support — the infrastructure exists at the API level, but it hasn’t been surfaced in the tool yet.

GitHub Copilot manages caching at the backend infrastructure level. Microsoft has been expanding context management capabilities under the Copilot memory umbrella.

Kiro is built on top of LLM infrastructure that supports both 5-minute and 1-hour TTL at the platform level. Specification-driven workflows benefit from longer cache windows since specs don’t change between tasks.

Cline and Aider are model-agnostic — their caching behavior depends on whichever underlying provider you’re using. Both are transparent about token usage, which makes it easier to verify whether caching is working.


The Cost Math in Practice

Three realistic scenarios, using Claude Sonnet reference pricing:

Active debugging session — 20K token file, 15 queries over 20 minutes:

Code review — 50K tokens across multiple files, 8 queries over 30 minutes:

Refactoring with large context — 100K token codebase context, 20 queries over 45 minutes:

The savings compound at scale. A team of developers all hitting cached contexts through a shared tool backend reduces aggregate inference costs substantially. This is part of why the economics of AI coding tools have improved faster than model pricing alone would explain.


Beyond Coding: Where This Pattern Extends

The more interesting question is where else caching changes the economics.

The underlying pattern — repeated queries against stable context — isn’t unique to code. It describes a lot of knowledge work:

Document review: A legal team reviewing contracts sends the same NDAs, policies, and clause libraries as context on every query. That context is stable across a full review session. Caching makes a 2-hour contract review substantially cheaper than a series of isolated queries.

Research synthesis: Analysts working through a corpus of reports, earnings calls, or academic papers repeatedly inject the same background material as grounding context. If that material lives in the first part of the prompt and doesn’t change, it caches.

Customer-facing workflows: Support tools that prime every interaction with the same product documentation, FAQ base, or policy set are paying full price to inject that context on every call — unless caching is enabled and the content is structured to be cacheable.

Long-running agent tasks: Agentic workflows with extended tool loops are essentially long conversations. The system prompt, tool definitions, and accumulated conversation history are all candidates for caching. This is where the 1-hour TTL (and longer — some infrastructure already supports 8-hour sessions) starts to matter more than it does in a quick coding session.

The common thread is: if your workflow involves sending the same tokens repeatedly, caching is free money. The only question is whether the tooling is doing it for you.


Where This Is Heading

A few trends worth watching:

Automatic caching is becoming more common. OpenAI already applies it automatically on GPT-4o and newer models with no configuration required. The trend is toward caching that “just works” without explicit API integration.

Longer TTL options are expanding. 1-hour caching is now available on recent Claude models. Some infrastructure already supports multi-hour session windows for extended agentic workflows. As use cases push beyond single-session coding into multi-hour research or analysis tasks, longer TTL windows will matter more.

Caching as architecture — rather than an optimization applied after the fact — is going to shape how AI tools are built. The tools that get the cache hit rate right will have a structural cost advantage over tools that don’t, even with identical model access.

The future of AI tooling isn’t only about smarter models. It’s about infrastructure that makes repeated, context-heavy inference economically viable at the session and workflow level — not just at the single-query level. Prompt caching is the clearest example of that shift already in production.


Pricing examples use Claude Sonnet reference rates from Anthropic’s public documentation and are illustrative. Verify current rates at the provider level before building cost models.


Share this post on:


Previous Post
The filesystem is your agent's routing layer
Next Post
The Right Model for the Right Job