AI Agent Evaluation 101: Why You Cannot Test an Agent Like a Chatbot

The first instinct when testing an AI agent is to treat it like a chatbot: send it a message, look at the reply, decide if it's good. This works fine for a demo. It fails catastrophically at scale.

Here's why — and what you should do instead.

Chatbots respond. Agents decide.

A chatbot's job is to produce a response. You can evaluate that response directly: was it helpful? Accurate? Grammatically correct? The output is the product.

An agent's job is to complete a task. It might call five different tools, branch based on what it finds, maintain state across multiple steps, and produce a final result that depends on the entire execution path — not just the last message.

When you test an agent with "does the output look good?", you're evaluating the final sentence of a 10-step process. You're missing 90% of what actually happened.

What actually needs to be evaluated

When evaluating an AI agent, you need to assess multiple dimensions simultaneously:

Tool selection: Did the agent call the right tools in the right order?
Tool arguments: Were the parameters correct? Did it pass the right order ID to the lookup function?
Decision branching: Given the same input, does the agent consistently choose the correct path?
Output quality: Is the final response accurate, appropriately toned, and complete?
Failure handling: When a tool returns an error, does the agent recover correctly or hallucinate a response?
Latency and cost: Is the agent using efficient paths or calling redundant tools?

A chatbot eval can get away with judging just the output. An agent eval needs to judge the entire trace.

Why replay tests break

One common approach is to save "golden conversations" and replay them. If the new version produces the same outputs, it's passing. This works until the agent gets smarter.

A better agent might solve the same problem with fewer tool calls. Replay tests will mark it as failing.

The problem is that replay tests evaluate exact outputs, not correct behavior. An agent that achieves the same goal via a different (and possibly better) path will fail replay tests. You end up penalizing improvement.

The right mental model: evaluate assertions about behavior, not exact outputs. "Did the agent call `lookup_order` at some point during this task?" is a better assertion than "did the agent's second message exactly match this string?"

The assertion-based approach

Agent evals should be composed of typed assertions against the execution trace:

scenarios:
  - name: refund_request
    input: "I want a refund for order #12345"
    assert:
      # Tool-level assertions
      - tool_called: "lookup_order"
      - tool_arg: { fn: "lookup_order", key: "order_id", value: "12345" }
      # Output assertions  
      - contains: "refund"
      - not_contains: "I cannot help with that"
      # Behavioral assertions
      - sentiment: "empathetic"
      - steps_max: 4

These assertions are:

Deterministic — pass or fail, no judgment calls
Composable — combine multiple assertions per scenario
Traceable — when they fail, you know exactly what failed and why
Upgrade-safe — a better agent that uses fewer steps still passes if the goal is achieved

Scenarios, not sessions

The unit of an agent eval is a scenario, not a session. A scenario is a single, isolated test case: one input, one expected set of behaviors, one pass/fail outcome.

This matters because agents are non-deterministic. Running the same scenario twice might produce slightly different traces — but the important assertions (was the right tool called? was the output appropriate?) should be stable. If your assertions are so narrow that they fail on non-determinism, they're testing the wrong thing.

Where to start

If you're evaluating an AI agent for the first time, start here:

Identify 5 representative user tasks (refund, lookup, escalate, inform, disambiguate)
For each task, write one input and 2–3 assertions about what a correct execution looks like
Run them against your current agent and treat the results as your baseline
Add new scenarios whenever you find a bug in production

You don't need 500 scenarios to start. You need 5 good ones. The number grows as your agent grows.

Agent Jig runs YAML-defined scenarios against your agent and scores every assertion. Start with 5 scenarios and add CI integration in one step. Free plan, no credit card.

AI agent evaluation 101: why you cannot test an agent like a chatbot

Chatbots respond. Agents decide.

What actually needs to be evaluated

Why replay tests break

The assertion-based approach

Scenarios, not sessions

Where to start

Run your first agent eval in 5 minutes.