Skip to content
Go back

The Right Model for the Right Job

Everyone building with LLMs starts the same way: pick the best model, get it working, ship it. That’s the right call early on — you want the model’s ceiling, not its floor, when you’re still figuring out what works.

But once something is working, the question shifts. Not “does it work?” but “does it work well enough to use?” And for most real applications, “well enough” means fast enough, cheap enough, and reliable enough to become a habit rather than a novelty.

That’s where model selection becomes an architectural decision, not just a default.


The Default Problem

The default is always the strongest model available. That makes sense for exploration — you want maximum capability when you don’t yet know what the task requires. But it creates a pattern where every step in an agent pipeline runs on a model built for tasks far harder than what’s actually being asked of it.

Consider what most agent steps actually do:

None of these require deep multi-step reasoning. They require speed, instruction-following, and consistent output formatting. A frontier reasoning model applied to a formatting task is like using a surgeon to take a blood pressure reading — qualified, but not the right fit.


The Insight: Profile First, Right-Size Second

The practical approach is two phases.

Phase 1 — Build with the best model. Get the pipeline working end-to-end. Don’t optimize yet. The most capable model will paper over prompt gaps and architectural awkwardness. Let it.

Phase 2 — Profile where time actually goes. Once it works, instrument the pipeline. You’ll find the latency is rarely distributed the way you expect. Common findings:

Once you know where the time goes, you can ask: what does this step actually need?


Three Tiers, Not One

A useful mental model: think of the task’s cognitive load, not its business importance.

Task typeWhat it needsModel tier
Multi-step reasoning, ambiguous inputsBroad world knowledge, long context, nuanced judgmentFrontier (Opus, GPT-4)
Synthesis from retrieved contentInstruction-following, consistent formattingMid-tier (Sonnet)
Classification, routing, extractionSpeed, structured output, low latencyFast (Haiku)
Acknowledgement, simple confirmationMinimal — almost template-levelFast or hardcoded

The key insight: retrieval-augmented generation shifts cognitive load from the model to the retrieval system. When you’ve pre-fetched relevant context and injected it into the prompt, the model’s job is no longer to know the answer — it’s to format it. That’s a fundamentally different task with a fundamentally different model requirement.


The Pipeline Example

The natural architecture for a fast agent is one that pre-fetches context in parallel and collapses reasoning into a single LLM call:

message arrives

[parallel]
  → KB retrieval / semantic cache lookup
  → (optional) thread history fetch

single LLM call:
  - context pre-loaded in message
  - model formats and posts reply in one turn
  - no second round-trip needed

This matters because every LLM round-trip adds latency proportional to the model’s size. With a fast model and pre-fetched context, a question-answering agent can respond in under 10 seconds where a naive two-call architecture with a frontier model would take 25–30 seconds.

The trade-off is prompt precision. Smaller models are less forgiving of ambiguous instructions — you have to be explicit about what you want. The system prompt needs to be clearer. Tool descriptions need to be tighter. But this is usually a 30-minute problem, not a fundamental limitation.


When to Use the Bigger Model

This isn’t an argument for always using the smallest model. The right call depends on what the step actually requires.

Use the frontier model when:

Use the faster model when:


Cost Is a Byproduct, Not the Goal

Cost reduction usually follows naturally from right-sizing, but it shouldn’t be the primary motivation. The real goal is building something people actually use.

A bot that responds in 10 seconds gets used. One that takes 30 seconds becomes something people route around. The product outcome is different — not just cheaper, but better.

Similarly, a faster model that responds reliably is more valuable than a slower, smarter model that occasionally produces slightly better output. Consistency and speed are underrated product qualities in AI tooling.


The Prompt Corollary

Switching to a faster model forces better prompt engineering. This is a feature, not a bug.

Frontier models can reason their way around a vague instruction. Faster models follow instructions literally. When you move from “figure it out” to “do exactly this,” you’re forced to make your intent explicit — which usually improves reliability across all models, including the frontier ones.

The discipline of writing for a smaller model is the discipline of knowing exactly what you want. That clarity pays off regardless of which model you end up using.


Where to Start

If you’re building an agent today:

  1. Start with the frontier model. Validate the use case. Don’t optimize early.
  2. Instrument every step once it works. Log timestamps at tool calls, LLM calls, and network calls separately.
  3. Profile the bottleneck. It’s almost never where you expect.
  4. Ask “what does this step need?” for each slow step. Reasoning? Or formatting?
  5. Move steps that need formatting, not reasoning, to a faster model. Tighten the prompt. Test.
  6. Keep the frontier model for genuinely hard steps — or as a configurable override.

The goal isn’t the cheapest pipeline. It’s the fastest, most reliable one that delivers the quality your use case requires. Model selection is one lever among many — but it’s often the most underused one.


The underlying principle — match the model’s capability to the task’s cognitive load — holds regardless of which models, frameworks, or providers you’re working with. The pattern is architectural, not vendor-specific.


Share this post on:


Previous Post
Why AI Coding Tools Are Getting Cheaper: Prompt Caching Explained
Next Post
Claude Code Has Native OpenTelemetry. Almost Nobody Knows.