Building the MCP Proxy: What Broke and What I Changed

Part three of the MCP series. Part one covers the token tax. Part two covers the pattern. This one covers what happened when I built it.

I finished writing the pattern post and then built the thing. Two bugs surfaced immediately, one of them subtle enough that I want to document it here. There were also two design decisions I changed after seeing the system run — one about the search algorithm, one about what discover_tools actually returns.

This is the implementation post.

The Crash That Wasn’t a Crash

The proxy loads all downstream MCP servers at startup, collects their tool schemas, and builds a search index over them. For the search index I used TF-IDF — reasonable choice, well-understood, and I had numpy available in my development environment.

The proxy is installed by patching a config file to point at a Python interpreter. I hardcoded /usr/bin/python3 — the system Python on macOS. The proxy launched, the MCP servers spawned, 206 tool schemas loaded cleanly. Then it crashed with ModuleNotFoundError: No module named 'numpy'.

The system Python doesn’t have numpy. My development environment does. Classic path mismatch.

This would have been easy to debug if the crash had been visible. It wasn’t — the proxy runs as a subprocess spawned by the desktop app, stderr goes nowhere, and the app showed the proxy as “connected” because the initialization handshake completed before the crash. The symptom was that discover_tools calls returned nothing. No error. Just silence.

Fix was two lines: change the Python path in the config to the interpreter that has numpy, or remove the numpy dependency. I did both — changed the path and replaced TF-IDF with BM25, which is pure stdlib.

The lesson isn’t about numpy. It’s that subprocess-based tools fail silently in ways that interactive tools don’t. The startup sequence needs explicit health checks: load, build, verify, then report ready. “Process started” is not the same as “process is working.”

BM25 Over TF-IDF

Once I removed the numpy dependency, I needed a replacement. I used BM25 — same family of algorithms, better behavior for this specific use case, no external dependencies.

The difference that matters here: BM25 handles document length variation better than vanilla TF-IDF. Tool schemas are not uniform length. An email_inbox tool might have a 50-word description and three parameters. A create_opportunity tool in a CRM server might have 800 words of description, 30 parameters with individual descriptions, and nested object schemas.

TF-IDF scores the dense document higher just because it has more matching tokens — not because it’s more relevant. BM25 applies length normalization (the b parameter) and term saturation (the k1 parameter), so a query for “check email” doesn’t get outscored by a sprawling CRM schema that happens to mention email once in a parameter description.

In practice: the hard queries got better. “Log a customer call” reliably surfaces the right activity-logging tool at the top. “Pipeline review” surfaces the opportunity detail tool. These were noisy with TF-IDF. BM25 with k1=1.5, b=0.75 (standard defaults) got there without any tuning.

One behavior worth knowing: the search has a minimum score threshold. Results below 0.01 are dropped entirely. A completely wrong query doesn’t return noise — it returns nothing, with a “try broader terms” message. This is the right behavior, but it surprised me the first time I hit it: a query that’s too vague or too far from the tool’s vocabulary produces zero results rather than low-confidence results. When discovery returns nothing, the fix is almost always to rephrase the query, not to assume the tool doesn’t exist.

The synonym map still matters. BM25 is still lexical — it matches tokens, not meaning. Domain-specific abbreviations mean nothing to a search algorithm unless you map them explicitly. The synonym expansion layer that expands query and document tokens before indexing is load-bearing. I extended it significantly for domain-specific vocabulary — internal abbreviations, product names, process terms that appear in enterprise tooling but don’t have obvious natural language equivalents. The specific terms depend entirely on your stack. The pattern is universal: any term your team uses that doesn’t appear in a dictionary is a candidate for the synonym map.

The Schema Blob Problem

The bigger design fix was in what discover_tools returns.

The original implementation returned the raw inputSchema JSON for each matching tool. Something like:

{
  "name": "log_customer_interaction",
  "description": "Log a customer interaction or activity...",
  "inputSchema": {
    "type": "object",
    "properties": {
      "accountId": {"type": "string", "description": "..."},
      "subject": {"type": "string"},
      "activityDate": {"type": "string", "format": "date"},
      "activityType": {"type": "string", "enum": ["Call", "Meeting", "Demo"]},
      ...
    },
    "required": ["accountId", "subject"]
  }
}

The model reads this, figures out which fields are required, infers the right types and formats, and constructs a call_tool invocation. That’s four reasoning steps for every tool call that goes through discovery. It works, but it’s unnecessary work — and unnecessary work in a long-running session accumulates into unnecessary errors.

The fix: return a pre-filled template instead of a schema.

log_customer_interaction
Log a customer interaction or activity.

Ready to call with call_tool:
{
  "tool_name": "log_customer_interaction",
  "arguments": {
    "accountId": "<account_id>",
    "subject": "<subject>",
    "activityDate": "<YYYY-MM-DD>"
  }
}
# Optional: activityType, duration, relatedOpportunityId

The model copies the template, fills the placeholders, executes. No schema parsing, no type inference, no required-vs-optional reasoning. The placeholders are generated from the schema automatically — date fields get <YYYY-MM-DD>, integers get 0, booleans get false, strings get <field_name> in snake_case.

This changes discover_tools from a schema lookup to a usage recipe. The difference is small in tokens but meaningful in reliability — the model is completing a template instead of constructing a call from a specification.

One More Thing: Error Messages That Help

A related problem: when call_tool receives an unknown tool name, the original error was a bare string. “Unknown tool: log_activity.”

That’s accurate but not useful. The model doesn’t know what to do with it. The fix: make the error actionable.

Unknown tool: 'log_activity'.
Use discover_tools("log activity") to find the right tool name.

The hint reformats the tool name (strip underscores, lowercase) into a suggested query. It turns a dead end into a recovery path. The model reads the error, calls discover_tools("log activity"), gets the right schema, and continues. No human intervention.

Small change. High leverage.

What This Adds Up To

The proxy works. Here’s the measured before and after for my setup — 6 MCP servers, ~200 tools:

	Session start tokens	Cost per session*	Monthly (20 sessions/day)*
Default MCP loading	~217,000	~$0.65	~$261
With proxy	~2,100	~$0.006	~$2.50

*At $3 per million input tokens.

The cost reduction is real but it’s not the primary reason to care. The more important number is the context window. A model with a 200k-token context window that loads 217k tokens of tool schemas before the first message has essentially no room left for conversation history, retrieved documents, or intermediate reasoning. It’s spending the entire context budget on schemas it may never use.

After the proxy, ~198k tokens are available for actual work at session start. That’s the difference between a model that can hold a full conversation with supporting documents in context and one that starts every session half-lobotomized.

There’s a second effect that’s harder to measure but matters more in practice: session flow. When you start a session with 217k tokens already consumed, any moderately complex conversation — a few tool calls, some retrieved documents, a chain of reasoning — pushes the context window toward its limit within a handful of exchanges. Most AI interfaces respond to this by auto-summarizing the conversation: collapsing the history into a compressed summary so the session can continue. The summary loses intermediate reasoning, collapses the detail of earlier exchanges, and forces the model to work from a digest instead of the actual conversation. The session continues, but something has been lost.

The proxy doesn’t prevent auto-summarization. But starting at 2,100 tokens instead of 217,000 means a session can run substantially longer before that threshold is reached — which in practice means it often never is. A focused work session, with tool calls and retrieved context, typically completes without the context window becoming a constraint.

The changes from the first working version to the current one:

Startup: correct Python path, explicit health check
Search: BM25 replaces TF-IDF, extended synonym map for domain vocabulary
Response format: usage recipes instead of schema blobs
Error handling: unknown tools suggest discover_tools rather than failing silently

None of these are fundamental rethinks. They’re the gap between “pattern described” and “pattern running reliably.”

The part I’m still watching: how often the model reaches for discover_tools versus calling an always-on tool directly. If the always-on set is well-calibrated, most sessions should complete without discovery calls at all — the most common tools are already in context. Discovery is for the long tail. Getting that balance right is more art than algorithm.

Related: The MCP Token Tax and The Meta-Tool Pattern. The pattern described here runs in my own setup — the specifics of your synonym map and always-on set will differ, but the proxy architecture and the two-tool interface are the transferable pieces.