Voice Is a Layer, Not a Setting

TL;DR

Five writing skills with embedded voice instructions = five drifting definitions; the same person sounds like different writers within months.
The fix is four independent layers: mode detection → voice + quality → format → publish. Voice lives in one place, called by everything else.
A single correction to the centralized voice layer propagates instantly across blog posts, Slack threads, emails, and strategy docs — no hunting across five skills.
Mode detection runs before a single word is written, resolving context from a five-signal hierarchy (explicit override, recipient, role, channel, intent keywords) with no manual selection required.
Every major tool treats voice as a setting. Separate the layers, centralize the voice, and the drift problem disappears.

The person is the constant. The mode is the variable. The medium is irrelevant.

If you have five writing skills and each one defines voice separately, you have five competing voice definitions. Over time they drift. A blog post and a Slack thread about the same topic come out sounding like different people wrote them. Not because the agent changed — because the voice instructions were never in the same place.

The fix is architecture, not better prompts. Voice is a layer. It belongs in one place, called by everything else.

The Problem With Embedded Voice

Every AI writing tool faces the same temptation: put the voice instructions where the writing happens. The blog skill says “write in a practitioner voice, evidence-based, no hedging.” The email skill says “write professionally, direct, data-specific.” The Slack skill says “keep it short, action-oriented.”

Three skills, three definitions of “professional.” None of them wrong. All of them slightly different. The drift is imperceptible at first — a slightly different sentence rhythm here, a slightly different threshold for hedging there. After a few months of iterating each skill independently, the same person sounds like three different writers depending on which skill ran.

This is not a voice problem. It is an architecture problem.

Four Layers

The fix is separating concerns that were bundled together:

Layer 1: MODE DETECTION
  What voice variant to use — casual, professional, leadership,
  field, publishing, or builder. Resolved from context before
  writing begins. Never manual.

Layer 2: VOICE + QUALITY
  The universal standards that apply regardless of mode.
  Cliché guard. Citation rules. Quality checklist. Anti-patterns.
  One definition. Called by everything.

Layer 3: FORMAT
  Structure, length, frontmatter, conventions.
  Blog format. Slack format. Email format. Strategy doc format.
  Each content type has its own format layer.
  Format knows nothing about voice.

Layer 4: PUBLISH
  Upload, verify, RAG optimization.
  Always a separate explicit step.
  Never bundled into format.

The calling skill provides the format layer. It calls the voice layer. The voice layer calls mode detection. The result: the voice is consistent across every content type because it lives in one place, not five.

Layer 1: Mode Detection

Mode detection runs before a single word is written. A five-signal priority hierarchy resolves the correct voice variant from context:

Explicit override — “keep it casual” or “exec tone” wins immediately
Recipient override — per-person config for people who always get a specific mode
Role mapping — looks up the recipient in a contacts registry, maps relationship (peer, manager, customer, close colleague) to mode
Channel detection — Slack public channel → professional; email to external domain → field; blog post → publishing
Intent keywords — “ping him,” “heads up” → casual; “endorsement request” → leadership; “write a post” → publishing

Default: professional.

The agent never asks which mode to use. The signal is already there — recipient, channel, intent. The hierarchy reads it.

What makes this maintainable: the detection logic lives in a YAML config file, not code. Adding a new recipient override is a one-line edit. Adjusting a keyword mapping takes ten seconds. No code change needed when the context changes.

Layer 2: Voice + Quality

This is the layer most tools skip. Every writing skill embeds its own voice definition. The four-layer architecture pulls that definition out and centralizes it.

The voice layer owns:

The cliché guard — a universal banned-phrase list that runs on every piece of output regardless of mode or format. “Robust,” “seamless,” “comprehensive,” “game-changing” — banned everywhere, always, because they are placeholders for the specific thing the writer actually means. The guard does not restrict expression. It forces specificity.

The never_say lists — mode-specific bans that load with the engram. Casual mode bans “I hope this note finds you well.” Leadership mode bans “either way, no worries if not.” Publishing mode bans credential framing. The bans are decisions, not style preferences — they encode what the writer has explicitly rejected in real output.

The quality checklist — conditions that must be met before output returns: opens with outcome not setup; every falsifiable claim has a source or “in my experience” label; no credential framing; has a “so what”; ends on action not opt-out.

Citation rules — inline links for every factual claim, “in my experience” for unlinkable observations. Not footnotes. Not optional.

Because this layer is centralized, a correction made in one place propagates everywhere. When “robust” gets added to the cliché guard, it is banned in blog posts, Slack threads, emails, and strategy docs simultaneously. No hunting across five skills to update five separate voice definitions.

Layer 3: Format

Format is what changes by content type. A blog post needs frontmatter, a filename convention, a category, a length target. A Slack thread needs a hook, a body, a close. An email needs subject, greeting, body, action. A strategy doc needs thesis, evidence, what’s missing, so what.

Format skills are pluggable. Any format skill can call the voice layer. Blog format + publishing voice. Slack format + casual voice. Strategy doc format + leadership voice. The combination is arbitrary because the layers are independent.

This is the same principle behind separation of concerns in software architecture. The format skill does not know about voice. The voice layer does not know about format. Both apply — simultaneously, independently.

Layer 4: Publish

Publishing is always a separate explicit step. Never bundled into format.

The format skill produces a draft. When the draft is ready, a publish step handles the mechanics: RAG optimization for AI-readable structure, filename validation, upload, verification. One publish skill works for any content to any destination — because publish is format-agnostic.

Why separate? Because “format” and “ready to publish” are different states. A draft can be formatted correctly and still need review. Separating the layers makes that review natural — the format skill delivers a draft, the author reviews, the publish step runs when ready.

What the Market Offers

Every major tool treats voice as a setting, not a layer:

Approach	What it does	What’s missing
Custom GPT / Claude Styles	Single voice profile from samples	No mode switching. DM = exec email = blog post.
Per-skill voice encoding	Voice defined inside each writing skill	5 skills = 5 definitions = drift
Engram builder (native)	Extracts one profile from message corpus	Single mode. No auto-detection. No never_say.
Brand voice guides	Organizational standards	Not machine-readable. Not enforced at write time.

Nobody has separated mode detection, voice quality, format, and publish into independent layers with clean interfaces between them. The closest analog is what Google did for visual identity with DESIGN.md — a single machine-readable source of truth for brand standards, called by any agent building UI. The writing equivalent is a centralized voice layer, called by any skill producing written output.

The Consistency Principle in Practice

What changes by mode: casual is shorter and warmer. Professional is strategic and data-specific. Leadership is personal and confident. Field is customer-obsessed. Publishing is universal and evidence-based. Builder is precise and structured.

What never changes: evidence-backed claims. No clichés. No hedging. Specific over vague. Peer voice, not trainer voice.

A blog post and a Slack thread about the same topic should feel like the same person wrote them. One is longer and more structured. The other is shorter and more direct. But the thinking, the specificity, the conviction, and the anti-patterns are identical — because those properties live in the voice layer, not in the blog skill or the Slack skill.

The person is the constant. The mode is the variable. The medium is irrelevant.

This is the final post in a four-part series on building mode-specific voice profiles for AI agents. The series starts here.