How to Write Eval Cases That Actually Catch Regressions

Most teams start their eval suite with the happy path: write a scenario for the textbook version of each feature, verify it passes, move on. Six months later, they have 200 scenarios, 100% passing, and a production incident every other sprint.

The happy path is the path you thought of. Regressions live everywhere else.

The three scenario types you actually need

A good eval suite covers three categories, not one:

1. Happy path scenarios

The canonical version of each task. These should always pass. If they don't, something is seriously broken. Don't over-index on these — they're necessary but not sufficient.

2. Edge case scenarios

Variations on the happy path that expose brittleness. Misspellings, ambiguous inputs, incomplete information, uncommon phrasings. If your agent handles "I want a refund" but fails on "can u plz cancel my order and get my money back" — that's a regression waiting to happen.

3. Regression scenarios

The most important category: scenarios written specifically to prevent bugs from recurring. Every time you fix a production bug, write a scenario that would have caught it. This is your institutional memory.

The best eval suite is a graveyard of production bugs. Every scenario should have a story.

The anatomy of a good scenario

A well-written eval scenario has four properties:

Single intent: Tests one thing. If your scenario tests three behaviors at once, you won't know which one failed.
Realistic input: Written the way a real user would write it — not the way an engineer imagines users write things.
Behavioral assertions: Tests what should happen, not what the exact output is.
Negative assertions: Tests what should NOT happen (the agent shouldn't claim it can't help when it can; shouldn't expose internal system state; shouldn't hallucinate an order that doesn't exist).

scenarios:
  # Good: single intent, realistic input, behavioral assertions
  - name: refund_ambiguous_phrasing
    input: "hey can u help me cancel my thing from last week and get money back"
    assert:
      - tool_called: "lookup_order"
      - contains: "refund"
      - not_contains: "I don't understand"
      - sentiment: "helpful"

  # Also good: regression scenario for a past bug
  - name: regression_no_hallucinate_order
    input: "I want a refund for order #99999"
    tags: [regression, bug-2024-11-12]
    assert:
      - tool_called: "lookup_order"
      - tool_arg: { fn: "lookup_order", key: "order_id", value: "99999" }
      # Bug was: agent hallucinated a response when order wasn't found
      - not_contains: "Your refund has been processed"

How many scenarios do you need?

Not as many as you think. A focused suite of 30 high-quality scenarios will catch more regressions than a sprawling suite of 300 happy-path clones.

A reasonable starting ratio:

~40% happy path (the core tasks your agent handles)
~40% edge cases (the ways things go wrong at the edges)
~20% regression scenarios (the bugs you've already fixed)

The regression category should only grow. Every production bug becomes a permanent eval case.

The "scenario as documentation" principle

A side effect of a well-written eval suite: it documents your agent's expected behavior better than any README. When someone new joins your team, the eval suite tells them: here is every case the agent should handle, and here is exactly what correct behavior looks like.

Write your eval cases as if they're the spec. Because they are.

What not to do

Don't use exact string matching for outputs. Agents are non-deterministic. "contains" and "not_contains" are more robust than "equals".
Don't test implementation details. If you have a scenario that only passes when the agent sends exactly 3 messages, you're testing scaffolding, not behavior.
Don't write scenarios only on launch day. Eval suites that are only written once don't catch regressions — they catch a frozen moment in time.
Don't ignore flaky scenarios. If a scenario fails intermittently, it means you're testing something non-deterministic. Fix the assertion, not the agent.

Agent Jig makes YAML eval cases first-class — write them in your repo, run them in CI, add new ones every time you fix a bug. Start your eval suite free.

How to write eval cases that actually catch regressions

The three scenario types you actually need

1. Happy path scenarios

2. Edge case scenarios

3. Regression scenarios

The anatomy of a good scenario

How many scenarios do you need?

The "scenario as documentation" principle

What not to do

Write your first 5 scenarios today.