Define eval scenarios in YAML. Run them in CI. Get pass/fail scores and regression diffs on every deploy.
The Problem
Teams update a prompt or swap a model and push to production — then find out three days later their agent is worse.
Your model provider upgrades their backend. Your agent's tool-calling rate drops 18%. You find out via a support ticket from your biggest customer.
You improve the system prompt for one persona. The fix works — but now edge cases that were handled fine are silently failing for 4% of users.
Without a locked eval baseline, you have no objective answer to "is this version better or worse than the one from last Tuesday?" Just vibes.
How Agent Jig Works
A jig holds a workpiece in place so every cut is repeatable. Agent Jig does the same for your evals.
Write eval cases as human-readable YAML. Input, expected output, assertions — all in one place. Review them in PRs like any other code.
One CLI command runs your full eval suite. Works in GitHub Actions, CircleCI, and any shell. Fails the build when your agent regresses.
Each scenario produces a deterministic pass or fail. No "it depends." Score trends over time — see exactly when and why quality changed.
Every run is compared against your locked baseline. See which scenarios regressed, by how much, and what changed in the agent's output.
Wraps any agent — LangChain, LlamaIndex, Crew, custom Python, Node. If it takes input and returns output, Agent Jig can evaluate it.
Lock a known-good eval run as your baseline. All future runs are diffed against it. Ship with confidence: green means no regressions.
Quick Start
No infrastructure to provision. No SDK to learn. Just YAML and a CLI command.
Define test scenarios: the input to your agent, what you expect back, and which assertions to apply. Start with 5 scenarios. Grow from there.
Point Agent Jig at your agent's endpoint or local process. It feeds each scenario in, captures the response, and scores every assertion.
Lock the passing run as your baseline. Add one step to your CI pipeline. Now every PR shows a regression diff before it merges.
scenarios:
- name: refund_request
input: "I want a refund for order #12345"
assert:
- contains: "refund"
- tool_called: "lookup_order"
- name: escalate_angry_customer
input: "This is the worst service ever"
assert:
- sentiment: "empathetic"
- tool_called: "create_ticket"
- not_contains: "I cannot help"
From the Blog
Agents make decisions, call tools, and branch. Unit tests don't work. Here's what does.
Read →Most eval cases are too narrow. Here's a framework for writing scenarios that surface real-world failures.
Read →The most dangerous regressions are the ones you don't see coming. How to catch them before users do.
Read →Free plan. First eval in 5 minutes. No infrastructure required.
Run your first eval free