▲ Catch regressions before your users do

Test your AI agent
before your users do.

Define eval scenarios in YAML. Run them in CI. Get pass/fail scores and regression diffs on every deploy.

Run your first eval in 5 minutes See pricing

agent-jig · CI run #247 · main → deploy

$ agent-jig run --config eval.yaml --baseline main ✓ refund_request PASS 142ms ✓ escalate_angry_customer PASS 89ms ✓ missing_order_lookup PASS 201ms ✓ multi_step_checkout PASS 167ms ✓ ambiguous_intent_fallback PASS 73ms 5/5 passing · +0 regressions · baseline locked ✓

The Problem

Shipping agents without evals is flying blind.

Teams update a prompt or swap a model and push to production — then find out three days later their agent is worse.

Day 3

The silent regression

Your model provider upgrades their backend. Your agent's tool-calling rate drops 18%. You find out via a support ticket from your biggest customer.

Week 2

The prompt update gamble

You improve the system prompt for one persona. The fix works — but now edge cases that were handled fine are silently failing for 4% of users.

Forever

No regression baseline

Without a locked eval baseline, you have no objective answer to "is this version better or worse than the one from last Tuesday?" Just vibes.

How Agent Jig Works

The fixture that holds your agent still.

A jig holds a workpiece in place so every cut is repeatable. Agent Jig does the same for your evals.

📄

YAML-defined scenarios

Write eval cases as human-readable YAML. Input, expected output, assertions — all in one place. Review them in PRs like any other code.

🔁

CI-native execution

One CLI command runs your full eval suite. Works in GitHub Actions, CircleCI, and any shell. Fails the build when your agent regresses.

📊

Pass/fail scoring

Each scenario produces a deterministic pass or fail. No "it depends." Score trends over time — see exactly when and why quality changed.

🔍

Regression diffs

Every run is compared against your locked baseline. See which scenarios regressed, by how much, and what changed in the agent's output.

🧪

Framework-agnostic

Wraps any agent — LangChain, LlamaIndex, Crew, custom Python, Node. If it takes input and returns output, Agent Jig can evaluate it.

🔒

Baseline locking

Lock a known-good eval run as your baseline. All future runs are diffed against it. Ship with confidence: green means no regressions.

Quick Start

First eval in 5 minutes.

No infrastructure to provision. No SDK to learn. Just YAML and a CLI command.

Write your eval.yaml

Define test scenarios: the input to your agent, what you expect back, and which assertions to apply. Start with 5 scenarios. Grow from there.

Run against your agent

Point Agent Jig at your agent's endpoint or local process. It feeds each scenario in, captures the response, and scores every assertion.

Lock and integrate CI

Lock the passing run as your baseline. Add one step to your CI pipeline. Now every PR shows a regression diff before it merges.

eval.yaml

        scenarios:
          - name: refund_request
            input: "I want a refund for order #12345"
            assert:
              - contains: "refund"
              - tool_called: "lookup_order"
         
          - name: escalate_angry_customer
            input: "This is the worst service ever"
            assert:
              - sentiment: "empathetic"
              - tool_called: "create_ticket"
              - not_contains: "I cannot help"
      

Test your AI agent
before your users do.

Shipping agents without evals is flying blind.

The silent regression

The prompt update gamble

No regression baseline

The fixture that holds your agent still.

YAML-defined scenarios

CI-native execution

Pass/fail scoring

Regression diffs

Framework-agnostic

Baseline locking

First eval in 5 minutes.

Write your eval.yaml

Run against your agent

Lock and integrate CI

Everything on AI agent evals.

AI agent evaluation 101: why you cannot test an agent like a chatbot

How to write eval cases that actually catch regressions

The silent regression problem: when your AI agent gets worse without you noticing

Start shipping with confidence.

Test your AI agentbefore your users do.

Shipping agents without evals is flying blind.

The silent regression

The prompt update gamble

No regression baseline

The fixture that holds your agent still.

YAML-defined scenarios

CI-native execution

Pass/fail scoring

Regression diffs

Framework-agnostic

Baseline locking

First eval in 5 minutes.

Write your eval.yaml

Run against your agent

Lock and integrate CI

Everything on AI agent evals.

AI agent evaluation 101: why you cannot test an agent like a chatbot

How to write eval cases that actually catch regressions

The silent regression problem: when your AI agent gets worse without you noticing

Start shipping with confidence.

Test your AI agent
before your users do.