llms.txt — Making Your Site Navigable by Agents

Your site has two audiences now. Browsers render your HTML for humans. Agents need something else entirely.

When an AI agent visits a website, it doesn’t see your carefully designed layout, your navigation bar, or your footer links. It sees a wall of text extracted from DOM elements — ads, cookie banners, navigation chrome, JavaScript-rendered content that may not even load. The conversion from HTML to useful context is lossy, expensive, and unreliable.

llms.txt fixes this. It’s a Markdown file at your site’s root that gives AI systems a curated, structured map of what’s here and where to find it. Think of it as robots.txt for the inference era — except instead of telling crawlers what to avoid, it tells agents what to consume and how.

The Problem: HTML Is a Terrible Agent Interface

HTML was designed in 1993 for documents rendered in browsers. Thirty years of evolution added navigation menus, advertising slots, JavaScript bundles, cookie consent modals, embedded tracking, and layout frameworks. All of it optimized for human visual processing.

An AI agent processing that same page has to:

Fetch the HTML (hoping it’s not a JavaScript SPA that renders nothing server-side)
Strip navigation, ads, footers, and chrome
Extract the actual content
Hope the extraction preserved structure, links, and code blocks
Fit the result into a context window that can’t hold the full page anyway

Every step is lossy. Navigation links get mixed with content links. Code blocks lose formatting. Tables collapse. Metadata vanishes. The agent gets a degraded version of what you wrote.

This is the interface mismatch: your content is structured and valuable, but the delivery format (HTML) was never designed for machine consumption at inference time.

The Convention: What llms.txt Actually Is

Jeremy Howard proposed llms.txt in September 2024. The idea is simple: put a Markdown file at /llms.txt that serves as a curated index of your site for AI systems.

The format:

# Site Name

> One-line description of what this site is.

Optional context paragraphs — key information an agent needs
to understand everything else.

## Section Name

- [Page Title](https://url): Brief description of what's there

## Optional

- [Less Important Page](https://url): Can be skipped for shorter context

That’s the entire spec. An H1 with the site name. A blockquote summary. Optional context. Then sections of links with descriptions. The ## Optional section has special meaning — agents can skip it when context is tight.

The adoption has been rapid. Over 844,000 sites had implemented llms.txt by late 2025. Cloudflare, Anthropic, Stripe, Zapier, and Vercel all ship one. It’s not a niche experiment — it’s becoming baseline infrastructure for any site that expects AI systems to interact with its content.

llms-full.txt: The Complete Corpus in One Request

llms.txt is the map. llms-full.txt is the territory.

Where llms.txt provides navigation and structure, llms-full.txt contains the actual complete content — every page, concatenated into a single Markdown file. One HTTP request, one response, full corpus.

The relationship between the two is complementary:

	llms.txt	llms-full.txt
Purpose	Navigation and structure	Complete content
Size	Small (< 10KB)	Large (can be multiple MB)
Use case	Quick orientation, selective retrieval	Full-context assistance, RAG ingestion
Analogy	Table of contents	The entire book

Different AI tools use them differently. A chat assistant might read llms.txt to understand what’s available, then fetch specific linked pages as needed. A development environment like Cursor or Claude Code might prefer llms-full.txt — load the entire corpus into context and work with complete knowledge. A RAG pipeline might ingest llms-full.txt wholesale and chunk it for semantic search.

The dual-file approach means you serve both patterns: selective retrieval for context-constrained systems, and full ingestion for systems with room.

My Implementation: Build-Time Generation in Astro

This site runs on Astro — a static site generator that compiles everything to HTML at build time. The llms.txt and llms-full.txt files are generated as part of the same build process.

/llms.txt is hand-authored. It’s a curated index — I decide what sections to highlight, what descriptions to write, what the site’s one-line summary is. This is editorial work, not automation. It looks like this:

# Artificial Curiosity Labs

> Writing about AI-native work, agent infrastructure, and what happens
> when curiosity meets technology.

## Content

- [Blog](https://artificialcuriositylabs.dev/posts): All posts
- [About](https://artificialcuriositylabs.dev/about): Who I am
- [Full text for LLMs](https://artificialcuriositylabs.dev/llms-full.txt): Complete content
- [RSS Feed](https://artificialcuriositylabs.dev/rss.xml): Subscribe

## Topics

- AI-native work as an operating model
- AWS Bedrock — AgentCore, Claude models, inference patterns
- Claude Code — setup, ops, MCP server configuration
- Multi-agent architectures and patterns

## Permissions

This site grants permission to AI systems to index, retrieve,
and cite all content, provided attribution is given.

/llms-full.txt is auto-generated. A build script reads every .md file from the blog content directory, preserves frontmatter (title, date, description, tags), and concatenates them with --- separators. The script runs in under a second as part of the normal Astro build.

The generator is straightforward:

Glob all .md files from src/data/blog/
Read each file’s content (frontmatter included — agents benefit from structured metadata)
Concatenate with a header block: site name, author, last-updated date
Write to public/llms-full.txt

No runtime. No API calls. No database. Just a build step that reads files and writes a file. The output is a static asset served from the CDN like any other page — cached globally, available instantly.

One detail worth calling out: the ## Permissions section in my llms.txt explicitly grants AI systems the right to index, retrieve, and cite the content with attribution.

This matters because the legal landscape around AI training and inference-time retrieval is unsettled. robots.txt was designed for crawling, not for inference-time consumption. Some sites use robots.txt to block AI crawlers entirely. Others want their content consumed but not used for training.

The permissions block in llms.txt is the clearest signal a site owner can give: yes, AI systems may use this content at inference time, under these conditions. It’s not legally binding in the way a license is — but it’s an explicit, machine-readable statement of intent that removes ambiguity.

What This Actually Enables

The payoff isn’t theoretical. Here’s what happens when your site has a well-structured llms-full.txt:

Any AI agent can consume your entire site in one request. No crawling, no pagination, no JavaScript rendering. A single fetch returns clean Markdown with preserved structure, links, and metadata.

Citation becomes trivial. When an agent pulls from your llms-full.txt, the source URL is known, the content is clean, and attribution is straightforward. Compare this to crawling HTML where the agent has to guess which page a paragraph came from.

RAG ingestion is zero-friction. Want your site’s content in a knowledge base? Point the ingestion pipeline at llms-full.txt. The content is already chunked by post (separated by ---), already in Markdown (the universal intermediate format for RAG), already has metadata (frontmatter).

MCP servers can serve your content. An MCP server that makes your site queryable by agents? Fetch llms-full.txt on startup, chunk it, embed it. The plumbing that would normally require a custom scraper, HTML parser, and content extraction pipeline collapses to one HTTP GET.

Future AI search engines index you better. Perplexity, SearchGPT, Gemini search — these systems increasingly look for llms.txt as a signal of AI-readiness. Sites with llms.txt surface in more AI-generated answers because the content is pre-structured for consumption.

The Dual-Consumption Architecture

This is the architectural insight: the same content, authored once in Markdown, serves two completely different consumption patterns through two completely different interfaces.

                    ┌─────────────┐
                    │  Markdown   │
                    │   Source    │
                    │  (author)   │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │                         │
              ▼                         ▼
    ┌─────────────────┐      ┌─────────────────┐
    │   Astro Build   │      │  llms-full.txt  │
    │   → HTML/CSS    │      │   Generator     │
    │   → JS bundle   │      │   → Markdown    │
    └────────┬────────┘      └────────┬────────┘
             │                         │
             ▼                         ▼
    ┌─────────────────┐      ┌─────────────────┐
    │    Browsers      │      │   AI Agents     │
    │    (humans)      │      │   (machines)    │
    └─────────────────┘      └─────────────────┘

No content duplication. No sync problem. One source of truth generates both interfaces. When you write a new post, the next build produces both the HTML page and the updated llms-full.txt automatically.

This is the same pattern that made APIs successful alongside web UIs — same data, different interface for different consumers. The web learned this lesson with REST APIs in the 2000s. We’re learning it again now for AI consumption.

The Asymmetric Bet

Implementing llms.txt took an afternoon. The llms.txt file itself is 30 lines of hand-written Markdown. The llms-full.txt generator is a short build script. The marginal cost of maintaining it is zero — it regenerates automatically every deploy.

The upside is unknown but structurally asymmetric. As AI agents become more prevalent — as more people interact with content through Claude, ChatGPT, Perplexity, Cursor, and whatever comes next — having your content pre-structured for that consumption pattern is either table stakes or a differentiator. Either way, the cost was near-zero and the decision is irreversible in the good direction.

The sites that implemented RSS early didn’t know exactly how it would be used either. Some of those feeds are still being consumed twenty years later by tools the authors never imagined.

What’s Next

The llms.txt convention is still early. Jeremy Howard’s original spec is intentionally minimal — an H1, a blockquote, sections of links. That’s it. No schema validation, no required fields beyond the title, no versioning.

Open questions worth watching:

Versioning: Should llms-full.txt include a content hash or version identifier so agents can check if it’s changed since last fetch?
Partial retrieval: For sites with hundreds of pages, should there be intermediate files — llms-full-section.txt — for selective loading?
Structured metadata: Should frontmatter conventions standardize beyond the basic title/description/date pattern?
Freshness signals: How does an agent know when to re-fetch? Cache headers help, but a last-updated timestamp in the file itself is more reliable for agents that don’t inspect HTTP headers.

For now, the baseline is clear: put a llms.txt at your root, generate a llms-full.txt at build time, add a permissions block, and make your content available to the next generation of consumers. The cost is an afternoon. The upside compounds.