When Sonnet Isn't Enough

5 min readFebruary 3, 2026

agentsmodelsprompt-engineering

When Sonnet Isn't Enough

Day 3 was quieter — a Saturday, mostly spent refining agents and adding utility capabilities. But it surfaced a lesson that would shape the entire architecture: model selection matters more than prompt quality for complex synthesis tasks.

Model quality comparison — Opus vs Sonnet across task types

The Problem

My first agents all ran on Sonnet (Claude's mid-tier model). It's fast, cheap, and handles most coding tasks beautifully. But when I tested the morning briefing agent, the results were... flat.

The agent could read calendar data. It could pull Jira tickets. It could search Confluence. But when it came time to synthesize — to connect a standup discussion to a sprint risk to a meeting prep item — the output read like a list, not a briefing.

It would produce:

``` Meeting: Sprint Review

Related tickets: PROJ-123, PROJ-456
Recent Confluence: Sprint 4 Goals page ```

When what I needed was:

``` Meeting: Sprint Review

PROJ-123 (auth migration) is blocked — discussed in Tuesday's standup but no update since. Blocker: waiting on platform team response.
Sprint velocity is 15% below target (32/38 points). The auth migration accounts for 8 points of the gap.
Prep: Review the risk register before this meeting. The platform dependency was flagged 2 sprints ago and hasn't moved. ```

The difference isn't about having more data — both outputs had access to the same information. It's about reasoning across sources to produce insight.

The Model Decision

I switched all agents to Opus (Claude's highest-tier model). The difference was immediate:

Synthesis quality: Opus connects dots across 3-4 data sources naturally
Instruction following: Opus handles complex multi-step instructions without skipping steps
Judgment calls: Opus knows when to flag something as a risk vs. just listing it as a fact

The trade-off is cost. Opus is roughly 5x more expensive than Sonnet per token. But for agents that run once or twice a day and need to produce trustworthy output, the cost delta is worth it.

Utility Agents

I also added two non-TPM agents on Day 3:

documentation-engineer: For creating and organizing technical docs, API references, and guides
project-manager: For project planning, risk assessment, and timeline management

These are general-purpose — not tied to my specific TPM workflows. Think of them as "expert modes" I can invoke when needed.

Prompt Engineering Lessons

While testing agents, I learned a few prompt patterns that dramatically improved output quality:

1. Give calibration points, not just instructions.

Bad: "Pull the relevant Jira tickets for each meeting."

Good: "Expect 3-8 relevant tickets per meeting. If you find fewer than 2, broaden your JQL query. If you find more than 10, you're probably too broad."

2. Define what "done" looks like with concrete examples.

Including a sample output section in the agent's instructions — even just a skeleton — improved format consistency by roughly 100%.

3. Separate "gather" from "synthesize" steps.

Agents produce better analysis when data collection and interpretation are explicitly separate phases. "First, pull all the data. Then, analyze it." Sounds obvious, but without this separation, the agent tends to start interpreting before it has the full picture.

The Cost Pause

Day 3 cost $68 — lower than Day 2 because I ran fewer agent dispatches. But I was noticing a pattern: every test run costs $5-15 in API calls. Iterating on prompt quality means running the agent 3-5 times to evaluate changes.

This is the hidden cost of agent development. The code is free — it's just markdown files. But testing is expensive because each run touches real APIs and processes real data.

Key Takeaways

Model quality trumps prompt quality for synthesis tasks — you can't prompt your way to reasoning the model isn't capable of
Use the expensive model for high-judgment tasks, plan to right-size later once you know which tasks are mechanical
Give agents calibration points ("expect 3-8 results") rather than just instructions
Separate data collection from analysis in agent workflows
Budget for iteration — each prompt refinement costs a real test run