Most teams start their eval suite with the happy path: write a scenario for the textbook version of each feature, verify it passes, move on. Six months later, they have 200 scenarios, 100% passing, and a production incident every other sprint.
The happy path is the path you thought of. Regressions live everywhere else.
A good eval suite covers three categories, not one:
The canonical version of each task. These should always pass. If they don't, something is seriously broken. Don't over-index on these — they're necessary but not sufficient.
Variations on the happy path that expose brittleness. Misspellings, ambiguous inputs, incomplete information, uncommon phrasings. If your agent handles "I want a refund" but fails on "can u plz cancel my order and get my money back" — that's a regression waiting to happen.
The most important category: scenarios written specifically to prevent bugs from recurring. Every time you fix a production bug, write a scenario that would have caught it. This is your institutional memory.
The best eval suite is a graveyard of production bugs. Every scenario should have a story.
A well-written eval scenario has four properties:
scenarios:
# Good: single intent, realistic input, behavioral assertions
- name: refund_ambiguous_phrasing
input: "hey can u help me cancel my thing from last week and get money back"
assert:
- tool_called: "lookup_order"
- contains: "refund"
- not_contains: "I don't understand"
- sentiment: "helpful"
# Also good: regression scenario for a past bug
- name: regression_no_hallucinate_order
input: "I want a refund for order #99999"
tags: [regression, bug-2024-11-12]
assert:
- tool_called: "lookup_order"
- tool_arg: { fn: "lookup_order", key: "order_id", value: "99999" }
# Bug was: agent hallucinated a response when order wasn't found
- not_contains: "Your refund has been processed"
Not as many as you think. A focused suite of 30 high-quality scenarios will catch more regressions than a sprawling suite of 300 happy-path clones.
A reasonable starting ratio:
The regression category should only grow. Every production bug becomes a permanent eval case.
A side effect of a well-written eval suite: it documents your agent's expected behavior better than any README. When someone new joins your team, the eval suite tells them: here is every case the agent should handle, and here is exactly what correct behavior looks like.
Write your eval cases as if they're the spec. Because they are.
Agent Jig makes YAML eval cases first-class — write them in your repo, run them in CI, add new ones every time you fix a bug. Start your eval suite free.
YAML-based, CLI-runnable, CI-integrated. Free plan to get started.
Start free