AI Agent Evals for Beginners: Testing Non-Deterministic AI

You read a post like this. You nod along. Then you hit a wall. The same wall I hit. “How do you actually ‘test’ an AI agent when it gives you a slightly different answer every single time?”

A normal unit test checks add(2, 2) == 4. It is 4 today, 4 tomorrow, 4 forever. But an agent? Ask it the same question twice and you get two different sentences. So how does any test pass “the same way every time”?

Here’s the short answer, and the whole point of this post: you stop testing for the exact output, and you start testing for properties that must be true no matter how the agent phrases it. Once that clicks, evals stop being scary. Let me show you with real examples.

The necessary mental shift

Say you build a tiny agent that answers a customer’s refund question. You ask it: “Can I get a refund on order 1234?”

Run one: “Yes, you’re eligible for a refund on order #1234. It will be processed in 3 to 5 days.”

Run two: “Good news, order 1234 qualifies for a refund and you’ll see it within a week.”

A traditional assert output == "..." test is useless here. The strings differ. But look closer. Both answers share things that are actually what you care about:

They said the customer is eligible (the decision is “approve”).
They mentioned the correct order number, 1234.
They did not leak another customer’s data.
They gave a timeframe.

Those are properties. They’re true in both runs even though the wording changed. An eval is just a test that checks properties instead of exact text. That’s the whole trick. The non-determinism is in the phrasing, not in the facts you require.

So when people say a test “passes the same every time,” they don’t mean the output is identical. They mean the property holds every time. “Did it approve the refund?” is True in both runs, so the test is green in both runs.

Force the agent to give you something checkable

The first practical move: don’t make your agent reply in free-flowing prose if you want to test it cleanly. Have it return structured output (JSON) alongside any human-friendly text. This is the single biggest thing that makes agents testable.

Instead of hoping a sentence contains the word “approve,” you ask the model to fill in a shape:

{
  "decision": "approve",
  "order_id": 1234,
  "reason": "Within 30-day return window",
  "customer_message": "Good news, order 1234 qualifies for a refund..."
}

Now you have hard fields to assert against. The customer_message can vary all it wants. The decision field cannot. You test the field, not the prose.

Level 1: assertion tests (the cheap ones you run constantly)

Hamel lays out three levels of evaluation. Level 1 is plain assertions, and it’s where you should start. No fancy framework, no AI judging anything. Just code that checks the properties above. These are fast and free, so you run them on every change.

Here is a dummy example in pytest. Pretend run_agent() calls your agent and returns that JSON dict:

import re

def test_refund_is_approved_for_valid_order():
    result = run_agent("Can I get a refund on order 1234?")

    # Property 1: the decision is a valid value, not garbage
    assert result["decision"] in {"approve", "deny", "needs_review"}

    # Property 2: for this in-window order, it should approve
    assert result["decision"] == "approve"

    # Property 3: it references the right order
    assert result["order_id"] == 1234

    # Property 4: it never leaks an internal UUID into the customer message
    assert not re.search(
        r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}", result["customer_message"]
    )

Notice what is happening. Not one of these asserts checks the exact sentence. They check facts and safety rules. Run this 100 times and the wording changes 100 times, but a correctly behaving agent passes all 100. That’s your “same result every time.”

This is also where you encode bugs you’ve already seen. Agent leaked a UUID once in production? Write the regex assert so it can never happen again silently. Your eval suite becomes a memory of every mistake the agent has made.

You will not get 100%, and that is fine

Here is the part that trips up people coming from normal testing. Sometimes the agent will fail an eval. Maybe 3 times out of 100 it denies a refund it should have approved. With normal unit tests, one red test means “broken, stop everything.” With evals, you think in pass rates.

You do not ask “did it pass?” You ask “what percentage passed, and did that percentage get worse than last time?” A 97% pass rate that drops to 89% after you tweaked a prompt is the signal. The number is the test, not a single run.

So the workflow looks like this:

Build a small dataset of test cases (start with 10 to 20 by hand).
Run the agent against all of them.
Count how many satisfy your asserts.
Track that score over time. When it drops, you broke something.

test_cases = [
    {"query": "Refund on order 1234?", "expected_decision": "approve"},
    {"query": "I want my money back for 9999", "expected_decision": "deny"},
    {"query": "where is my stuff",          "expected_decision": "needs_review"},
    # ...add more as you find edge cases
]

passes = 0
for case in test_cases:
    result = run_agent(case["query"])
    if result["decision"] == case["expected_decision"]:
        passes += 1

score = passes / len(test_cases)
print(f"Decision accuracy: {score:.0%}")  # e.g. "Decision accuracy: 95%"

That’s a real eval. It’s also just a for loop. Don’t let the jargon convince you it’s more complicated than this to start.

Diagram of the three levels of AI agent evals: assertion tests, LLM-as-a-judge and human review, and A/B testing in production

Level 2: when there is no clean assert (LLM-as-a-judge)

Asserts are great for things you can pin down: a decision, an order number, a JSON shape, a leaked secret. But what about “was the reply polite?” or “did it actually answer the question?” You cannot regex your way to “polite.”

This is where LLM-as-a-judge comes in. You use a second, capable model to grade the output against a rubric you write. It is still checking a property. You have just handed the judging to a model because the property is fuzzy.

A dummy judge prompt looks like this:

judge_prompt = """
You are grading a customer support reply. Score PASS or FAIL.

Reply to grade:
---
{reply}
---

It PASSES only if ALL of these are true:
1. It directly answers whether a refund is approved or denied.
2. The tone is polite and professional.
3. It does not promise anything not stated in the reply (no made-up dates).

Answer with a single word: PASS or FAIL, then one sentence why.
"""

Two things to get right here, and they are the things beginners skip:

Use a strong model as the judge, ideally a different or more capable one than the agent being tested. A weak judge gives you noise. In practice people often run a top model like Claude Opus as the grader.
Check the judge against yourself. Before you trust the judge, hand-label 20 outputs as pass/fail yourself, then see if the judge agrees with you. If the judge says PASS on things you would fail, fix the rubric before you rely on it. Hamel calls this keeping a “mini-evaluation system” for your judge. The judge is just another thing that needs evaluating.

A good habit from current practice: write your rubric as a checklist of explicit conditions (“all of these must be true”) rather than a vague “is this good?” Vague rubrics give you flaky grades, which puts you right back where you started.

Level 3: A/B testing (only once you ship)

The third level is the real world. You release a change to a slice of actual users and measure what they do: did they resolve their issue, did they escalate to a human, did they come back angry. This is the most expensive and slowest signal, so you only reach for it after big changes, not on every commit. For most people reading this, Levels 1 and 2 are where 90% of the value is. Do not skip ahead to A/B tests before you have a single assertion written.

What about testing the agent’s steps, not just its answer?

One thing that is more relevant for agents than for plain chatbots: agents take actions. They call tools, search a database, hit an API. A correct final answer that took a dumb path is still a problem (it might be slow, expensive, or lucky).

So you can also assert on the trajectory, the sequence of tool calls. If you have built with tools or MCP, you already have these calls logged.

def test_agent_looks_up_order_before_deciding():
    result = run_agent("Refund on order 1234?")
    tools_used = [step["tool"] for step in result["trace"]]

    # It must actually check the order, not just guess
    assert "lookup_order" in tools_used
    # It should not email the customer before a decision is made
    assert tools_used.index("lookup_order") < tools_used.index("send_email")

Same idea as before. You’re not checking the exact words. You’re checking that the right things happened in a sensible order. Properties, again.

A dead-simple plan to start this week

You do not need a platform or a budget. Here is the smallest version that is still real:

Make your agent return JSON with the fields you care about.
Write 10 test cases by hand, with the expected decision or behavior for each.
Write 5 to 10 plain assert checks for the hard properties (valid values, no leaks, right IDs).
Loop over your cases and print a pass rate.
Add a single LLM-as-a-judge check for one fuzzy property like tone, and sanity-check the judge against your own labels.
Re-run the whole thing every time you change a prompt or swap a model, and watch the number.

That’s it. That’s an eval suite. When the score moves, you’ll know whether your “improvement” actually improved anything, which is the whole reason evals exist. If you are building agents that run on their own, this measurement piece is what makes agentic loops safe to trust in the first place.

The non-determinism never goes away. You just stop testing the thing that varies (the words) and start testing the things that must not vary (the facts, the safety, the behavior). Once you make that switch, testing an AI agent feels a lot like testing anything else.

AI Agent Evals FAQ

If the AI output changes every time, how can an eval pass consistently?

Because a good eval does not check the exact text. It checks properties that stay true regardless of wording, like whether a decision field equals "approve" or whether the right order ID appears. The phrasing varies, the property does not.

Do I need a special tool or framework to write evals?

No. Your first eval suite can be a plain pytest file or a for loop that runs your agent over a list of test cases and counts how many satisfy your assertions. Tools help at scale, but you should start with plain code.

What is LLM-as-a-judge?

It is using a second, capable model to grade an output against a rubric you write, for qualities you cannot check with simple code, like tone or helpfulness. Always validate the judge against your own hand-labeled examples before trusting it.

What pass rate is good enough?

There is no universal number. The point is to track your pass rate over time and catch regressions. A score that drops after a change is the signal that you broke something, even if it is still high.