Skip to content
Go back

The MCP Token Tax

MCP won the read path. 97 million downloads, 13,000+ servers, an open standard the whole industry adopted in 16 months. And almost nobody noticed the bill it left behind — or the gap it didn’t close.

I was in the middle of a working session with Cowork today — connected to Slack, Outlook, Salesforce, a browser, a code environment, and a handful of other tools — when I asked the agent a question that made me stop.

The agent knew it was running with 200+ tools loaded. I asked whether that was a problem.

It said: yes, actually. Every tool definition in every connected MCP server gets serialized into the model’s context window at session start. Before I type a single word, a significant fraction of the available context is already spent — on tool schemas, parameter descriptions, and function signatures I may never use.

I’d been thinking about token cost at the filesystem layer — path length, folder names, how much of a CLAUDE.md loads into working memory. This is the same problem two layers up. But as I followed the thread, I found something more interesting than a token problem: a design problem hiding underneath it.


The number that stopped me

MyClaw published a benchmark I keep coming back to. They ran a simple task — checking which programming language a repository was written in — two ways: via CLI and via MCP.

CLI: 1,400 tokens. MCP: 44,000 tokens.

Same task. Same result. 31x more expensive through the agent interface.

That’s not a pathological case. That’s a routine task on a modest setup. Maxim AI documented a 508-tool setup burning 75 million input tokens before the session did any real work — 92.8% of total input tokens consumed before the user spoke. $377 in token costs reduced to $29 when they intervened.

The math is linear and unforgiving: more servers → more tools → more tokens consumed before anything happens. There’s no filtering, no relevance check, no “load only what this conversation needs.” Everything loads. Every time.


Why this happens: MCP’s default behavior

MCP’s default is simple — when a session starts, every tool definition from every connected server gets loaded into the model’s context. All of them. All at once.

This made sense at the protocol’s origin. A handful of tools, a narrow context window, a small set of use cases. Load everything, let the model decide.

The world changed faster than the default. By early 2026, production setups routinely connect 5–10 MCP servers, each exposing 10–60 tools. A knowledge worker’s full stack lands at 200–400 tool definitions before the first message. The protocol was designed for tens of tools. It’s running hundreds.


What MCP is genuinely good at

The token tax is real, but it shouldn’t obscure what MCP actually solved.

MCP is excellent at read operations on structured systems. Reading Slack messages, querying Salesforce records, fetching calendar events, pulling file contents. The pattern is consistent: the agent needs to retrieve something, the system has a stable schema, and the response is structured data the agent can reason about.

For this use case, MCP is the right architecture. Centralized auth, consistent schema, agent-discoverable capabilities, a universal transport. The 97 million SDK downloads and 13,000+ servers in 16 months represent an ecosystem that mostly built read operations — and built them well.

The read path is solved. That’s real progress.


What MCP is not good at

The write path is not solved.

MCP tools are atomic by design. Each tool is a single, well-defined operation with a name, parameters, and a return value. email_read returns emails. crm_search returns records. post_message sends a message. Clean, typed, predictable.

The problem surfaces the moment you need to compose those operations into a workflow. Consider something straightforward: “Draft a reply to the highest-priority email from a Tier 1 account, check if we have an open opportunity in Salesforce, and if yes, post a summary to the account Slack channel.”

That’s three tools in sequence, with conditional logic, state carried between steps, and a decision point in the middle. MCP has no native mechanism for this. There’s no piping, no state management, no conditional branching at the protocol layer. The agent’s reasoning loop has to carry all of it — which is the wrong place for durable orchestration logic.

CLI doesn’t have this problem. Unix pipes are composability by design:

az vm list --query "[?powerState=='VM running']" -o json \
  | jq -r '.[].name' \
  | xargs -I {} check_compliance "{}"

Each command outputs to the next. The pipeline is the workflow. Retry logic, conditional branching, state — all of this is the shell’s job, not the reasoning model’s job.

The deeper reason CLI wins at composition: LLMs have 50 years of CLI training data. They know grep | awk | jq cold. They don’t need to read a schema. MCP’s patterns, by contrast, are 18 months old. The model has to reason its way through every tool chain from first principles.


The emerging picture: three primitives, not one

The honest answer to “how should agents interact with systems” is not MCP for everything. It’s three primitives for three use cases:

MCP for reads. Structured retrieval from external systems where schema consistency and auth matter. Slack, Salesforce, calendar, files. This is MCP’s home territory and it excels here. Keep it.

CLI for orchestration. Multi-step workflows, conditional logic, state across operations. The agent writes and executes a shell pipeline. Composable, token-efficient, and built on patterns the model already knows. Benchmarks show 33% better token efficiency vs. MCP equivalents, and CLI completed tasks in browser automation that MCP simply couldn’t.

Code execution for complex configuration. When CLI isn’t enough — programmatic logic, loops, data transformation, decisions based on intermediate results — the agent writes Python and runs it. Anthropic explicitly recommends this path for complex workflows. Code is the most flexible write primitive available, and a sandboxed execution environment makes it safe.

The token tax problem looks different through this lens. Part of why production setups have 400 MCP tools is that teams are using MCP for things CLI handles better. If configuration and orchestration tasks moved to where they belong architecturally, the MCP tool count drops, the token overhead drops, and the remaining MCP sessions are genuinely read-heavy and benefit from the protocol.


Progressive loading is still needed — just at a smaller scale

The infrastructure fixes the ecosystem is building — progressive discovery, semantic tool routing — are still the right direction for the read path. A vector index over tool descriptions, queried at session start, returning only the 3–5 most relevant tools, benchmarks at 97.1% hit rate and 98–100x token reduction. Local-first implementations run under 2MB with sub-10ms query latency. (Note: tools/list_changed is a change-notification mechanism — it tells the client the tool list has been updated, not a lazy-loading primitive. Claude Code shipped native deferred loading via ToolSearch in May 2026 — validating the pattern. Most other MCP clients, including Claude Desktop’s Cowork surface and third-party agent harnesses, still load everything upfront. The proxy pattern in the rest of this series closes that gap for any client.)

Another desktop agent platform I use takes this further at the platform level. It exposes ~500 tools across built-in capabilities, managed connectors (Slack, Outlook, Salesforce, OneDrive), and custom MCP servers — but only ~35 base tools load at session start. The remaining ~465 are organized into lazy-loaded “skills” — one-line descriptions visible to the agent, full schemas injected on demand only when the task requires them. No proxy, no discovery tool, no extra turn cost. The platform itself is the progressive disclosure layer. It’s the cleanest implementation of this pattern I’ve seen: the agent has a map of 500 tools but pays the token cost of 35 until it actually needs something specific. That’s ~93% deferral built into the architecture, invisible to the user.

But progressive loading on a 400-tool MCP setup is solving the symptom. The cause is using MCP as the universal agent integration primitive when it was designed for structured reads.

Fix the architecture first. The token tax largely fixes itself.


The open question

The read path has a protocol: MCP. The write path doesn’t have a consensus answer yet.

CLI and code execution are the best current options — and the benchmarks back them up. But neither is as ergonomic as MCP for agent integration. There’s no universal discovery layer for CLI tools. There’s no standard schema format. The agent has to know the tools exist and roughly how they work.

Workflow engines — Temporal, n8n, Prefect — add durable execution and state management, but they’re infrastructure that needs to be deployed and maintained. Agent frameworks like LangChain and DSPy add orchestration on top of MCP, papering over the composition gap without solving it.

MCP won the read path in 16 months. The write path problem is well-defined now. That’s usually the precondition for a protocol to emerge.

I’d watch for it.


If you’re running a large MCP setup and have measured your token overhead — or if you’ve found a durable answer to the write path problem — I’d be interested in the numbers.


Share this post on:


Previous Post
Building the MCP Proxy: What Broke and What I Changed
Next Post
The MCP Proxy, Running