Quality & Observability

AgentFlow ships a full QA stack: model-free unit testing, LLM evaluation with 10 built-in criteria, LLM-as-user simulation, and observability hooks that wire into any external monitoring system.

Unit Testing — model-free, fast

Unit tests run without any LLM calls. TestAgent cycles through predefined responses; your graph logic, routing, and tool wiring are exercised at full speed.

TestAgent

Drop TestAgent into any node to replace the real Agent:

from agentflow.qa.testing import TestAgent
from agentflow.core.state import Message

test_agent = TestAgent(responses=[
    Message.text_message("I'll check the weather.", role="assistant"),
    Message.text_message("It's 22°C and sunny.",    role="assistant"),
])

compiled.override_node("MAIN", test_agent)

TestAgent cycles through responses in order. On exhaustion it repeats the last entry.

QuickTest

One-liner helpers for common scenarios:

from agentflow.qa.testing import QuickTest

# single-turn: send one message, assert on the response
result = await QuickTest.single_turn(
    compiled,
    user_message="What is 2+2?",
    agent=TestAgent(responses=[Message.text_message("4", role="assistant")]),
)
result.assert_text_contains("4")

# multi-turn: list of user messages, returns list of TestResult
results = await QuickTest.multi_turn(
    compiled,
    messages=["Hello", "What's my name?"],
    agent=test_agent,
)

# tool-call scenario: assert the agent requested a specific tool
result = await QuickTest.tool_call(
    compiled,
    user_message="Search for AgentFlow docs",
    agent=test_agent,
)
result.assert_tool_called("search")

MockToolRegistry

from agentflow.qa.testing import MockToolRegistry

registry = MockToolRegistry()
registry.register("search", returns={"results": ["doc1", "doc2"]})
registry.register("calculator", returns={"value": 42})

# ... run graph with registry attached ...

registry.assert_called("search", times=1)
registry.assert_called_with("calculator", expression="6*7")

TestResult — fluent assertions

result.assert_text_contains("Paris")
result.assert_text_not_contains("error")
result.assert_tool_called("search")
result.assert_tool_called_with("search", query="capital of France")
result.assert_message_count(3)
result.assert_no_errors()

Running tests

# via CLI (reads agentflow.json for config)
agentflow test
agentflow test --coverage

# via pytest directly
cd your-project
pytest tests/

Evaluation — quality measurement

Evaluation runs your agent against a set of cases and scores each response against one or more criteria. Results go to a report.

Defining cases

from agentflow.qa.evaluation import EvalCase, EvalSet, EvalSetBuilder

# build manually
eval_set = EvalSet(cases=[
    EvalCase(
        query="What is the capital of France?",
        expected_response="Paris",
        expected_tools=["search"],
    ),
    EvalCase(
        query="What is 6 × 7?",
        expected_response="42",
    ),
])

# or with the builder
eval_set = (
    EvalSetBuilder()
    .add("What is the capital of France?", expected="Paris", tools=["search"])
    .add("What is 6 × 7?", expected="42")
    .build()
)

Running evaluation

from agentflow.qa.evaluation import AgentEvaluator, EvalConfig, EvalPresets

evaluator = AgentEvaluator(
    compiled_graph=compiled,
    config=EvalConfig(
        criteria=EvalPresets.response_quality(),   # built-in preset
        reporters=["html", "json"],
        output_dir="eval-results/",
        threshold=0.8,          # fail if mean score < 0.8
    ),
)

report = await evaluator.arun(eval_set)
print(report.mean_score)

Or with QuickEval for one-liners:

from agentflow.qa.evaluation import QuickEval

report = await QuickEval.run(compiled, eval_set, threshold=0.75)

EvalPresets

Preset	Criteria included
`EvalPresets.tool_usage()`	`tool_name_match_score`, `tool_trajectory_avg_score`
`EvalPresets.response_quality()`	`response_match`, `factual_accuracy`, `hallucination`
`EvalPresets.quick_check()`	`rouge_match`, `response_match`
`EvalPresets.comprehensive()`	All 10 criteria

The 10 built-in criteria

No-LLM criteria — deterministic, no extra API cost:

Criterion	What it measures
`tool_name_match_score`	Whether the expected tools were called
`tool_trajectory_avg_score`	Average score across the full tool call sequence
`node_order`	Whether nodes fired in the expected order
`rouge_match`	N-gram overlap between response and expected text

LLM-judge criteria — semantic, requires a judge model:

Criterion	What it measures
`response_match`	Semantic equivalence to expected response
`llm_judge`	Custom rubric evaluated by an LLM
`factual_accuracy`	Factual correctness against a reference
`hallucination`	Presence of fabricated facts
`safety`	Harmful or policy-violating content
`simulation_goals`	Goal achievement across a simulated conversation

Custom criterion

from agentflow.qa.evaluation.criteria import BaseCriterion

class MyLengthCriterion(BaseCriterion):
    async def score(self, trajectory, response) -> float:
        words = len(response.text.split())
        return 1.0 if 50 <= words <= 200 else 0.0

Reports

Reporter	Output
`ConsoleReporter`	Printed table; good for local runs
`HTMLReporter`	Interactive dashboard with per-case drill-down
`JSONReporter`	Machine-readable; pipe into CI or dashboards
`JUnitXMLReporter`	JUnit XML; compatible with GitHub Actions, Jenkins

Custom reporter — extend BaseReporter:

from agentflow.qa.evaluation.reporters import BaseReporter

class SlackReporter(BaseReporter):
    async def generate(self, report, output_dir):
        await post_to_slack(f"Mean score: {report.mean_score:.2f}")

CLI

agentflow eval                          # discovers *_eval.py / eval_*.py
agentflow eval evals/qa_eval.py         # specific file
agentflow eval --parallel               # concurrent case execution
agentflow eval --max-concurrency 4
agentflow eval --threshold 0.8          # exit 1 if below threshold
agentflow eval --open                   # open HTML report in browser

User Simulation — LLM as user

UserSimulator drives a multi-turn conversation with your agent using an LLM to generate realistic user messages. Define goals; the simulator checks whether each goal is achieved across the full transcript.

ConversationScenario

from agentflow.qa.evaluation import ConversationScenario

scenario = ConversationScenario(
    scenario_id="support-order-late",
    starting_prompt="My order hasn't arrived and it's been two weeks.",
    conversation_plan=[
        "Ask about the order status",
        "Provide order number when asked",
        "Request a refund if not resolved",
    ],
    goals=[
        "Agent acknowledges the delay",
        "Agent offers a resolution (refund or reship)",
        "Conversation ends politely",
    ],
    max_turns=8,
)

UserSimulator

from agentflow.qa.evaluation import UserSimulator, UserSimulatorConfig

simulator = UserSimulator(
    compiled_graph=compiled,
    config=UserSimulatorConfig(
        simulator_model="gpt-4o",
        thread_id_prefix="sim-run-",
    ),
)

result = await simulator.run(scenario)
print(result.goals_achieved)    # list of goal strings that passed
print(result.score)             # fraction of goals achieved

BatchSimulator — concurrent scenarios

from agentflow.qa.evaluation import BatchSimulator

batch = BatchSimulator(compiled_graph=compiled, max_concurrency=4)
results = await batch.run([scenario1, scenario2, scenario3])

Each scenario gets an isolated thread_id; runs are fully concurrent up to max_concurrency.

SimulationGoalsCriterion

Use SimulationGoalsCriterion when you want to include simulation results in a standard EvalReport. It scores goal achievement across the full transcript — not just the last message.

from agentflow.qa.evaluation.criteria import SimulationGoalsCriterion

# Include in EvalConfig, not in regular criteria lists — it requires a full transcript
config = EvalConfig(
    criteria=[SimulationGoalsCriterion(simulator_model="gpt-4o-mini")],
)

Note: SimulationGoalsCriterion must not be mixed with per-message criteria in the same EvalConfig. Run it in a dedicated evaluation pass.

Exposing scenarios to the CLI

Name a list SCENARIOS or define get_scenarios() in your eval file:

# evals/simulation_eval.py
SCENARIOS = [scenario1, scenario2]

# or
def get_scenarios():
    return load_scenarios_from_db()

agentflow eval evals/simulation_eval.py

Wiring into observability

Eval and testing tell you whether your agent is correct in controlled conditions. For live production monitoring, two extension points feed execution data to external systems:

GraphLifecycleHook — fires on graph start/end, each state update, checkpoint, interrupt, and error. Use it to open OpenTelemetry spans, record Datadog/Prometheus metrics, or redact PII before state is persisted. Full details in Agents, Tools & Control.
BasePublisher — emits an EventModel on every execution event (node start, LLM call, tool call, state update, completion) to Kafka, Redis, RabbitMQ, or a custom backend. Full details in Serving Agents.

Go deeper

Guide	Link
Write your first eval	How-To: Evaluation
Custom evaluation criteria	Extensibility
Lifecycle hooks and callbacks	Agents, Tools & Control
Publisher backends	Serving Agents

Unit Testing — model-free, fast​

TestAgent​

QuickTest​

MockToolRegistry​

TestResult — fluent assertions​

Running tests​

Evaluation — quality measurement​

Defining cases​

Running evaluation​

EvalPresets​

The 10 built-in criteria​

Custom criterion​

Reports​

CLI​

User Simulation — LLM as user​

ConversationScenario​

UserSimulator​

BatchSimulator — concurrent scenarios​

SimulationGoalsCriterion​

Exposing scenarios to the CLI​

Wiring into observability​

Go deeper​