When I was building the intelligence layer for my team’s field KB, I nearly bought Mintlify.
Mintlify is well-designed. It ships an auto-generated MCP server at /mcp for every docs site — zero configuration, live the day you publish. Built-in Claude-powered search, /llms.txt auto-generation, PR preview deployments, an Autopilot agent that monitors connected repos and drafts documentation updates as PRs. Around $250–300/month for a Pro plan. No infrastructure to manage.
The feature list looked like exactly what I needed. I built my own stack instead — a custom Knowledge Base with a vector store, a Lambda-backed MCP server, a CI pipeline that exports markdown to S3 and re-syncs on every merge. With an AI coding tool, the implementation took roughly the same amount of time as setting up the SaaS would have. The decision had nothing to do with build time.
This post is about why. Not “why I’m special” — but the principle behind the decision, which generalizes to any AI infrastructure build vs. buy choice.
The problem docs platforms are built to solve
The underlying problem is real. 62% of customer-facing teams say their reference materials are consistently outdated. 54% of businesses use more than 5 disconnected tools for documentation and information sharing. 42% of organizations cite “employees are overworked and don’t have time for knowledge management” — not lack of tools, but lack of write path friction reduction.
The traditional solution — Highspot, Seismic, SharePoint, Confluence — hasn’t fixed this. These platforms are philosophically opposite to docs-as-code: proprietary, GUI-based, platform-locked, no version control. And even they haven’t solved the fundamental problem: 65% of company content goes unused by the teams it was created for, and 38% of enablement leaders cite outdated content as their top challenge.
Docs platforms like Mintlify represent a genuine improvement: markdown as source of truth, Git-based versioning, MCP for AI consumption. For developer documentation — API references, SDK guides, integration tutorials — this architecture is excellent.
The question is what happens when the use case is not developer documentation.
The feature comparison misses the question
The standard build vs. buy analysis is a feature table. You list what you need, check what the product covers, calculate the gap. If the gap is small, buy. If the gap is large, build.
This framing fails for AI infrastructure because it treats capability gaps as degree differences — the SaaS product has 80% of what you need, you just need to close the last 20%. That’s usually the right frame for business software. It’s the wrong frame when the 20% gap is a fundamental architectural difference.
Mintlify’s auto-generated MCP server supports metadata filtering — but only on built-in document attributes: version (v1.2), language (en), and relevance score threshold. There are no custom metadata fields. No buying_signals. No personas. No competitive_alternatives. No category. The filter surface is what makes sense for developer documentation: “show me v2 API docs in English.” It has no concept of a field intelligence dimension.
My use case was a field intelligence layer. The retrieval question is not “find pages about X” — it is “find pages matching a buying signal for a specific persona evaluating a specific product category.” That requires structured custom metadata pre-filters on vector search. No amount of adding to Mintlify’s filter surface gets you there, because those filter dimensions don’t exist in the platform’s data model.
These are not alternatives to the same capability. They are different capabilities.
The question is not “does this SaaS cover enough features?” The question is “is the architectural pattern of this SaaS the pattern my use case requires?”
What the decision actually rested on
Three things made building the right call:
1. Governance. An internal field intelligence site with account data, competitive analysis, and buying signals lives inside company infrastructure or not at all. Mintlify is SaaS-only — no self-hosted option. That eliminated it before the feature comparison started. This wasn’t a close call. If your content has data classification constraints, hosting location isn’t a variable. External research confirms this is the primary legitimate reason to build: “Unique data, regulatory needs, or competitive differentiation requirements unmet by vendors.”
2. Retrieval architecture. The gap wasn’t in a feature — it was in the model. What I needed was hierarchical chunking by heading structure, per-page metadata sidecars with custom structured fields, and pre-filter queries that narrow the vector search before it runs. That requires an architecture where I control the metadata schema, the chunking strategy, and the filter predicates. That is not a Mintlify feature that’s missing. It’s a different retrieval architecture built for a different use case.
3. The conversation value. If you are in customer conversations explaining why AI knowledge systems fail and how to build durable ones — having built one yourself is the credibility artifact. “We built it, here are the decisions we made, here’s what it cost, here’s what it unlocked” is a peer conversation. “We use the SaaS version” is not. This only applies when the build is genuinely required for your actual use case — it is not an argument for building everything.
The technical implementation
What building it actually involved:
Markdown structure
Every page follows a convention designed for machine consumption, not just human reading. Three structural layers:
---
title: [Page title]
description: [One-line summary for search and llms.txt]
category: [top-level category]
last_updated: YYYY-MM-DD
page_owner: [owner alias]
---
<!-- agent-context
category: [category]
tags: [tag-1, tag-2, tag-3]
personas: [persona-1, persona-2]
buying_signals: [signal-1, signal-2, signal-3]
competitive_alternatives: [alt-1, alt-2]
confidence: high|medium|low
-->
## Overview
## Key Capabilities
## Architecture
## Pricing
## Competitive Landscape
The <!-- agent-context --> block is a machine-readable metadata sidecar embedded in each page. A CI script reads this block on every merge, extracts the structured fields, and writes a .metadata.json file to S3 alongside the exported markdown. That metadata attaches to every chunk the KB produces from the page — buying signals, personas, and categories flow into retrieval filters automatically.
The heading structure (##) is not just organization — it determines chunk boundaries. Pages are kept short enough to fit within a single parent chunk, which is a retrieval constraint, not a content guideline.
Embedding model selection
Model selection for Bedrock Knowledge Base is a constrained optimization problem, not a free choice.
The first pick was cohere.embed-english-v4 — higher MTEB score (65.2 vs 62.96 for v3), newer model. Bedrock Knowledge Base rejected it. Root cause: Bedrock Knowledge Base doesn’t accept inference profile ARNs for embedding models as of early 2026. The newer model is only accessible via inference profiles.
Next candidate: Amazon Nova Embed. Available — but us-east-1 only. The KB runs in us-west-2.
Landing point: cohere.embed-english-v3 — MTEB score 62.96, available ON_DEMAND in us-west-2, accepted by Bedrock Knowledge Base without inference profile constraints. Not the highest-scoring model in isolation. The highest-scoring model available in us-west-2 that Bedrock Knowledge Base would accept.
This is a specific constraint most embedding model comparisons don’t surface: Bedrock KB has its own acceptance rules that don’t match Bedrock’s general model availability. Check CLI first:
aws bedrock list-foundation-models \
--by-output-modality EMBEDDING \
--region us-west-2 \
--query 'modelSummaries[*].modelId'
Then cross-reference with MTEB. “Best available” is a constrained optimization. The constraint is KB’s inference profile restriction — understand it before evaluating models you can’t use.
Chunking strategy
AWS Bedrock hierarchical chunking works as follows:
- Ingestion: Child chunks (small, precise) are created and embedded into the vector index. Parent chunks (larger context) are stored alongside.
- Retrieval: A query matches child chunks based on semantic similarity. The system automatically replaces matched child chunks with their parent chunks before returning results — section-level precision from the child match, page-level context from the parent return.
Configuration for field intelligence content:
| Level | Size | Purpose |
|---|---|---|
| Child chunk | 300 tokens | Paragraph-level. Drives vector match precision. Cohere v3’s 512-token limit applies — child chunks at 300 tokens are well within it. |
| Parent chunk | ~1,500 tokens | Section-level. Returned as context when a child matches. |
| Overlap | 10% | Continuity across consecutive chunks. |
300 tokens was calibrated to section-level content: a capabilities list, a pricing table, a competitive comparison. Short enough for precise retrieval, long enough to carry useful context.
One important constraint: “page = parent chunk” is an architectural choice in this implementation, not a Bedrock default. Bedrock splits parent chunks by token count, not by document boundary. Pages are deliberately kept short enough to fit within a single parent chunk. If a page is longer than that, Bedrock creates multiple parent chunks per page — only the matching parent segment returns, not the full document.
Why hierarchical over flat: Research on hierarchical chunking consistently shows 15–25% improvement in RAG faithfulness scores versus flat fixed-size chunking. Flat chunking loses structural context — a pricing figure retrieved without its surrounding product context is less useful than the same figure retrieved with the full pricing section.
The export pipeline
A CI script (export-to-kb.js) runs on every merge:
- Reads all markdown files under
content/ - Extracts the
<!-- agent-context -->block from each page - Strips the agent-context comment from the exported markdown (clean content to S3)
- Writes a
.metadata.jsonsidecar for each page with the extracted fields - Uploads both to S3:
pages/partner-name.md+pages/partner-name.metadata.json - Calls
StartIngestionJobon the Bedrock Knowledge Base to re-sync
Two specific bugs worth knowing:
Empty metadata attributes: Pages without an <!-- agent-context --> block produce .metadata.json files with empty string values. Bedrock Knowledge Base rejects empty strings — filter out empty-value keys before writing sidecars.
Concurrent pipeline conflict: Parallel CI runs both calling StartIngestionJob get a ConflictException (409) on the second call. Catch ConflictException and exit 0 — content is already in S3 and the running job handles it.
Metadata-filtered retrieval
The MCP server exposes three tools, each mapping to a different Bedrock Retrieve call:
// Pre-filter on category, then semantic search
{
"equals": { "key": "category", "value": "your-category" }
}
// Array membership — buying signals, personas stored as arrays
{
"listContains": { "key": "buying_signals", "value": "your-signal" }
}
// AND combination — narrow before vector search runs
{
"andAll": [
{ "equals": { "key": "category", "value": "your-category" } },
{ "listContains": { "key": "personas", "value": "your-persona" } }
]
}
A query like “what tools matter for [persona] evaluating [use case]?” filters on category, personas, and buying_signals before the vector search runs. The result set is the right subset. The vector search then ranks within that subset.
This is the difference between a retrieval tool and a search engine. A search engine returns pages where the words appear. A retrieval tool returns pages that match the field situation — who’s asking, what they’re evaluating, what signal brought them to the conversation.
What building actually cost — honest accounting
The core implementation — Lambda, S3 pipeline, vector store, CI sync scripts — took an afternoon with an AI coding tool. Research, integration testing, and edge cases added a day or two on top of that. The CI pipeline scripts are each under 200 lines. The infrastructure is standard. That part was fast.
The harder problem: getting the MCP server distributed in a standardized way with the right SSO auth and IT approval. This is where the build cost doesn’t compress the way the coding work does.
Distributing a new MCP server to team members means navigating internal software tooling registration — the equivalent of getting a package into a managed software catalog. That involves: security review, compliance sign-off, service registry enrollment, IT approval processes. The MCP server was built and working in days. Getting it provisioned through the standard managed tooling path took weeks — not because of technical problems, but because organizational process doesn’t compress the way code does.
This is the cost that most build vs. buy comparisons miss. External analysis puts DIY RAG infrastructure total cost of ownership at $3,000+/month versus ~$450/month for managed platforms at mid-scale — but that’s labor for a traditional engineering team. The real ceiling for a small team building with AI assistance is different. The coding overhead is low. The distribution and governance overhead is not.
The honest accounting:
| Cost category | With AI coding tool | Traditional build |
|---|---|---|
| Implementation | Days | Weeks–months |
| Debugging / testing | Hours | Days–weeks |
| Distribution / IT approval | Weeks–months (same either way) | Weeks–months (same either way) |
| Ongoing maintenance | Hours/month | Days/month |
The build cost curve changed for coding. It didn’t change for organizational process. If your use case requires a custom architecture, expect the implementation to be faster than intuition suggests — and expect the distribution to take as long as it always has.
Running the actual cost numbers
This matters for the decision, and most build vs. buy writeups skip it. Here’s the real comparison:
Mintlify Pro runs $250–$300/month base — but with AI usage overages ($0.13–0.15 per message beyond the 250/month included) and additional seats ($20/seat/month), a 5–10 person team using the AI search features pays $350–500/month in practice. That’s $4,200–$6,000/year.
A self-hosted RAG stack — S3, vector store, Lambda, API Gateway — runs $80–250/month for comparable usage. Third-party analysis puts self-hosted 2–5x cheaper than managed platforms at team scale, with breakeven at 2–4 months of SaaS subscription costs.
So even for cases where the architecture would fit a SaaS platform, the cost argument often doesn’t favor buying. The correct framing is not “building is waste if SaaS covers your needs” — it’s: run the actual numbers, account for AI usage overages, and compare against what your self-hosted infrastructure actually costs to run. The SaaS subscription may look cheap at the base price and expensive at actual usage.
The evaluation criteria: Does the architectural pattern fit? Do the costs justify the build overhead and distribution friction? Both questions need answers. Neither one alone makes the decision.
The feedback loop the SaaS cannot auto-generate
The architecture produces a signal that no SaaS editorial dashboard generates.
Every tool call to the MCP server logs a structured JSON record: tool name, query text, result count, latency, top relevance score, and which pages were returned. A query with top_score < 0.45 means the KB’s best match is still a weak one — the field builder needed something the KB can’t answer well. Aggregate those low-confidence queries and you have a content backlog built automatically by usage. No survey, no editorial meeting, no guessing.
{
"tool": "query_context",
"input": { "query": "rate limiting options for llm api calls", "n": 5 },
"result_count": 4,
"latency_ms": 382,
"top_score": 0.71,
"pages": ["/category-a/page-1", "/category-b/page-2"]
}
The feedback loop: field builder queries the KB in a customer conversation → query logs → gap list (low top_score queries) → KB page added or metadata updated → KB re-synced → next query gets a result.
This is content intelligence that emerges from usage, not from editorial judgment. It requires owning the log pipeline — which means it requires building, not buying.
The pattern generalizes
Build vs. buy for AI infrastructure resolves differently than for business software because:
Auto-generated AI capabilities have implicit architectural assumptions. Auto-generated MCP assumes text search over public content. Auto-generated embeddings assume general-purpose semantic similarity. Auto-generated chunking assumes flat prose documents. If your use case fits those assumptions, auto-generated is excellent. If it doesn’t, you’re not missing features — you’re missing architecture.
The build cost curve changed. 74% of companies struggle to scale AI value despite 78% adoption — in most cases because they bought platforms that don’t fit the problem shape, then tried to adapt. The gap between “buy the right architecture” and “build the architecture you need” used to be months of engineering. It’s days now. This changes the cost side of the equation without changing the capability side.
The AI infrastructure layer is load-bearing. How your KB is chunked, how your retrieval is filtered, how your MCP tools are shaped — these decisions determine what questions your agents can answer. Getting them wrong doesn’t produce a suboptimal feature. It produces a system that retrieves poorly under real query conditions, even if every component looks correct in isolation. Retrieval quality is not visible from the content — you find out at query time.
The build vs. buy question I arrived at: not “is this SaaS cheaper?” but “is this SaaS the right architecture for my use case — and do the actual costs support the decision either way?”
For an internal metadata-filtered field intelligence layer with governance constraints, no SaaS is the right architecture — and even on cost, a self-hosted stack comes out ahead at team scale.
For a public developer documentation site with text search needs and a small team: run the numbers. Mintlify at $250/month base looks cheap. At real usage, with AI overages and multiple seats, $400–500/month is common. A self-hosted stack with the right architecture may be both cheaper and better-fit. The SaaS is a reasonable choice if the zero-ops overhead is worth the premium — but it’s worth calculating, not assuming.
Capability first: does the architectural pattern fit your use case? Cost second: run the actual numbers, including overages and distribution friction — not just the base subscription price. The answer to both questions often points the same direction.