The MCP token tax has a fix. It’s not fewer tools. It’s a smarter way to load them.
In The MCP Token Tax, I wrote about a structural problem in how agents load tool schemas: everything loads at session start, before the user types a word. A production knowledge-work setup with 5–10 MCP servers and a few hundred tools can burn 100–200k tokens before the conversation begins. That’s real money, and in edge cases, it’s more tokens than the model’s context window can hold.
I’ve been running a fix. This post documents the pattern — not the implementation specifics of my setup, but the underlying architecture. It’s reusable, and I think it’s the right answer for any sufficiently large MCP deployment.
The Core Problem, Restated
MCP’s default behavior is: when a session starts, every tool schema from every connected server gets serialized into the model’s context. All tools. All at once.
This is fine at small scale. It becomes a problem when you have:
- 5+ MCP servers
- 100+ total tools
- Tool definitions with large schemas (enum fields, nested parameters, long descriptions)
The cumulative token cost grows linearly with tool count. Large schemas — common in enterprise CRM tools that enumerate dozens of status codes, or communication tools with many optional parameters — can run 1,000–5,000 tokens per tool. A 200-tool setup at even 1,000 tokens per tool is 200k tokens of overhead before the model reads your first message.
The naive fix — connect fewer servers — is a real cost, not a solution.
The Pattern: Progressive Tool Disclosure
The idea is to invert the loading model. Instead of loading all tool schemas at session start and letting the model pick from them, load almost nothing at session start and give the model a mechanism to surface what it needs on demand.
Two pieces make this work:
1. A discovery tool. A single tool — let’s call it discover_tools(query) — that accepts a natural language query and returns the most relevant tool schemas. The model calls it when it needs a capability it doesn’t already have in context.
2. A forwarding tool. A single tool — call_tool(name, arguments) — that forwards any tool call to the right underlying server. The model doesn’t need to know which server owns which tool. It just calls call_tool with the name and parameters.
These two meta-tools replace hundreds of individual tool definitions at session start. At a rough estimate, two compact meta-tools run around 100–150 tokens each versus 100k–200k tokens for the full schema set. The reduction is ~99%.
The Architecture
You put a proxy between the agent and all your real MCP servers.
Agent
↕ (stdio / JSON-RPC)
Proxy
├── Server A (subprocess)
├── Server B (subprocess)
├── Server C (subprocess)
└── ... N servers
The proxy does the following:
- On startup: spawn all real MCP servers as subprocesses, perform the
initialize/tools/listhandshake with each, collect all schemas. - Build a search index: index every tool name, description, and parameter description so queries can retrieve relevant schemas.
- Expose to the agent: two meta-tools (
discover_tools,call_tool) plus a small always-on set of high-frequency tools. - At runtime: when the agent calls
discover_tools("send a message"), search the index, return the top-N matching schemas. When the agent callscall_tool("post_message", {...}), route to the right server transparently.
The agent’s experience is: a small starting context, and the ability to expand it on demand.
The Always-On Set
Not all tools need to go through discovery. Some tools are used in nearly every session — common communication primitives, quick lookups, core identity tools. Routing those through discover_tools every time is unnecessary overhead.
The right approach: define a small always-on set that loads at session start the conventional way, alongside the meta-tools. These might be 8–12 tools total. For a knowledge-work setup, candidates include: the Slack post and get tools, email inbox and read tools, calendar view, web search, and a few core data lookups.
These load unconditionally. Everything else surfaces through discovery.
The Search Problem
Discovery only works if the search actually finds the right tool. A natural language query like “check my inbox” needs to return email_inbox, not just tools that literally contain the word “inbox” in their description.
This is a real problem if you’re implementing the search layer from scratch without external dependencies. Dense embedding models (sentence-transformers, OpenAI embeddings) handle semantic similarity well but may not be available in every environment. TF-IDF is available anywhere Python runs, but it has a synonym blindness problem — “check my email” won’t match a tool described as “retrieve inbox messages.”
The practical fix for TF-IDF environments: build a domain-specific synonym map and apply it at both index time and query time. The idea is to expand the tool documents when you index them (add “email” to anything that mentions “inbox”) and expand queries when they arrive (translate “check” → “view, retrieve, list”). This gets you most of the way to semantic matching without requiring embeddings.
Example synonyms that matter for common knowledge-work setups:
- email ↔ inbox, mail, message, correspondence
- slack ↔ message, channel, post
- find / search / look up ↔ query, retrieve, list, get
- deal / account ↔ opportunity, customer, salesforce record
- calendar ↔ meeting, event, schedule
With synonym expansion, TF-IDF handles typical natural-language queries reliably. The edge cases where it breaks down are highly domain-specific queries with unusual phrasing — these can be addressed by adding terms to the synonym map as you discover them.
What the Model Sees
At session start, the model’s tool context looks roughly like:
discover_tools(query: string) → list of matching tool schemas
Search for tools relevant to a task. Returns schemas for the top matching tools.
Call this when you need a capability that isn't already in your context.
call_tool(tool_name: string, arguments: object) → tool result
Execute any available tool by name. Routes to the correct server automatically.
Use with schemas returned by discover_tools or always-on tools.
[+ 8-12 always-on tools]
When the model encounters a task it doesn’t have a schema for — say, “create a CRM opportunity” — it calls discover_tools("create opportunity"), gets the right schema back, understands the parameters, and then calls call_tool("create_opportunity", {...}).
This adds one round-trip for each new tool category the model needs. In practice, a session touches 5–15 distinct tool categories. The overhead is small relative to the context savings.
Measured Impact
I’ll use round, illustrative numbers rather than my specific setup’s measurements.
A knowledge-work agent with 6 connected MCP servers and ~200 tools might load 150–220k tokens at session start under the default behavior — before the user types anything. With the proxy pattern, session start drops to ~2,000 tokens (meta-tools + always-on set). That’s a 98–99% reduction.
At $3 per million input tokens (Claude Sonnet pricing as of early 2026) and 20 sessions per day, the math:
| Tokens at session start | Cost/session | Monthly (20/day) | |
|---|---|---|---|
| Default MCP loading | ~200,000 | ~$0.60 | ~$270 |
| Progressive proxy | ~2,000 | ~$0.006 | ~$3 |
The savings compound if you’re running many concurrent sessions or a team deployment.
The more interesting constraint isn’t cost — it’s context window utilization. A model with a 200k-token context window that burns 200k tokens on tool schemas before the user speaks has zero room for anything else: conversation history, documents, retrieved content, intermediate reasoning. Progressive loading makes the context window available for actual work.
Implementation Notes
A few things I learned building this that aren’t obvious from the description:
stdio proxy vs. HTTP proxy. MCP servers come in two flavors: stdio (spawned as subprocesses, communicate via stdin/stdout) and HTTP (running as network services). Many off-the-shelf proxy implementations only handle one or the other. If your setup uses stdio servers — common for locally-installed tooling — you need a proxy that can spawn subprocesses and manage their lifecycle, not just forward HTTP requests.
Path and environment gotchas. The proxy inherits its environment from whatever launched it. If it’s launched from a desktop app (like Claude Desktop), the PATH and HOME may differ from what your terminal sees. Tools that depend on credentials, shell config, or binaries not in the standard path will fail silently. Make the env explicit in the proxy config rather than relying on inheritance.
Startup latency. The proxy has to initialize every server and collect all schemas before it can serve its first request. Depending on how many servers you have and whether any require auth handshakes (OAuth, SSO), this can take 5–15 seconds. Plan for it. A good proxy logs startup progress so you can see what’s slow.
Index quality matters. The discovery results are only as good as the tool descriptions in the underlying server schemas. Well-described tools with clear parameter names return correctly. Poorly described tools (terse one-liners, generic names, missing parameter descriptions) return noisily. If discovery feels unreliable for a specific tool, the fix is usually in the server’s tool description, not the search algorithm.
Connection to the Larger Argument
In the token tax post, I argued that progressive loading was “solving the symptom” if the underlying problem was using MCP for things that belong to the CLI or code-execution primitives. That’s still true.
But the meta-tool pattern doesn’t conflict with that argument — it addresses a different scope. Even a disciplined MCP deployment that reserves the protocol for structured reads will accumulate enough tools to make session-start loading expensive. The pattern I’ve described here isn’t a workaround for an architectural mistake. It’s a reasonable infrastructure layer for any deployment above a modest tool count.
The cutoff where it’s worth building: somewhere around 50 tools or any individual tool with a schema over 2,000 tokens. Below that, load everything. Above it, load nothing and discover on demand.
So What
The MCP ecosystem solved the read path. It didn’t solve the loading cost. Progressive tool disclosure is a straightforward pattern — a thin proxy, a search index, two meta-tools — that keeps the context window available for work rather than spending it on schemas the model may never use.
The two components that make it real: a proxy that speaks the MCP stdio protocol bidirectionally, and a search layer that handles natural language paraphrasing, not just exact term matching. Everything else follows from those two pieces.
If you’re running a large MCP setup and have measured your session-start token cost — or if you’ve shipped a variant of this pattern — I’d be curious what the discovery hit rate looks like at scale.
This is part of the Agent Primitives series. Related: What Is a Tool covers the tool primitive itself. The MCP Token Tax covers the problem this post addresses.