Skip to main content

Quality & Observability

AgentFlow ships a full QA stack: model-free unit testing, LLM evaluation with 10 built-in criteria, LLM-as-user simulation, and observability hooks that wire into any external monitoring system.


Unit Testing — model-free, fast

Unit tests run without any LLM calls. TestAgent cycles through predefined responses; your graph logic, routing, and tool wiring are exercised at full speed.

TestAgent

Drop TestAgent into any node to replace the real Agent:

from agentflow.qa.testing import TestAgent
from agentflow.core.state import Message

test_agent = TestAgent(responses=[
Message.text_message("I'll check the weather.", role="assistant"),
Message.text_message("It's 22°C and sunny.", role="assistant"),
])

compiled.override_node("MAIN", test_agent)

TestAgent cycles through responses in order. On exhaustion it repeats the last entry.

QuickTest

One-liner helpers for common scenarios:

from agentflow.qa.testing import QuickTest

# single-turn: send one message, assert on the response
result = await QuickTest.single_turn(
compiled,
user_message="What is 2+2?",
agent=TestAgent(responses=[Message.text_message("4", role="assistant")]),
)
result.assert_text_contains("4")

# multi-turn: list of user messages, returns list of TestResult
results = await QuickTest.multi_turn(
compiled,
messages=["Hello", "What's my name?"],
agent=test_agent,
)

# tool-call scenario: assert the agent requested a specific tool
result = await QuickTest.tool_call(
compiled,
user_message="Search for AgentFlow docs",
agent=test_agent,
)
result.assert_tool_called("search")

MockToolRegistry

Register mock tool implementations and assert they were called:

from agentflow.qa.testing import MockToolRegistry

registry = MockToolRegistry()
registry.register("search", returns={"results": ["doc1", "doc2"]})
registry.register("calculator", returns={"value": 42})

# ... run graph with registry attached ...

registry.assert_called("search", times=1)
registry.assert_called_with("calculator", expression="6*7")

TestResult — fluent assertions

result.assert_text_contains("Paris")
result.assert_text_not_contains("error")
result.assert_tool_called("search")
result.assert_tool_called_with("search", query="capital of France")
result.assert_message_count(3)
result.assert_no_errors()

Running tests

# via CLI (reads agentflow.json for config)
agentflow test
agentflow test --coverage

# via pytest directly
cd your-project
pytest tests/

Evaluation — quality measurement

Evaluation runs your agent against a set of cases and scores each response against one or more criteria. Results go to a report.

Defining cases

from agentflow.qa.evaluation import EvalCase, EvalSet, EvalSetBuilder

# build manually
eval_set = EvalSet(cases=[
EvalCase(
query="What is the capital of France?",
expected_response="Paris",
expected_tools=["search"],
),
EvalCase(
query="What is 6 × 7?",
expected_response="42",
),
])

# or with the builder
eval_set = (
EvalSetBuilder()
.add("What is the capital of France?", expected="Paris", tools=["search"])
.add("What is 6 × 7?", expected="42")
.build()
)

Running evaluation

from agentflow.qa.evaluation import AgentEvaluator, EvalConfig, EvalPresets

evaluator = AgentEvaluator(
compiled_graph=compiled,
config=EvalConfig(
criteria=EvalPresets.response_quality(), # built-in preset
reporters=["html", "json"],
output_dir="eval-results/",
threshold=0.8, # fail if mean score < 0.8
),
)

report = await evaluator.arun(eval_set)
print(report.mean_score)

Or with QuickEval for one-liners:

from agentflow.qa.evaluation import QuickEval

report = await QuickEval.run(compiled, eval_set, threshold=0.75)

EvalPresets

PresetCriteria included
EvalPresets.tool_usage()tool_name_match_score, tool_trajectory_avg_score
EvalPresets.response_quality()response_match, factual_accuracy, hallucination
EvalPresets.quick_check()rouge_match, response_match
EvalPresets.comprehensive()All 10 criteria

The 10 built-in criteria

No-LLM criteria — deterministic, no extra API cost:

CriterionWhat it measures
tool_name_match_scoreWhether the expected tools were called
tool_trajectory_avg_scoreAverage score across the full tool call sequence
node_orderWhether nodes fired in the expected order
rouge_matchN-gram overlap between response and expected text

LLM-judge criteria — semantic, requires a judge model:

CriterionWhat it measures
response_matchSemantic equivalence to expected response
llm_judgeCustom rubric evaluated by an LLM
factual_accuracyFactual correctness against a reference
hallucinationPresence of fabricated facts
safetyHarmful or policy-violating content
simulation_goalsGoal achievement across a simulated conversation

Custom criterion

from agentflow.qa.evaluation.criteria import BaseCriterion

class MyLengthCriterion(BaseCriterion):
async def score(self, trajectory, response) -> float:
words = len(response.text.split())
return 1.0 if 50 <= words <= 200 else 0.0

Reports

ReporterOutput
ConsoleReporterPrinted table; good for local runs
HTMLReporterInteractive dashboard with per-case drill-down
JSONReporterMachine-readable; pipe into CI or dashboards
JUnitXMLReporterJUnit XML; compatible with GitHub Actions, Jenkins

Custom reporter — extend BaseReporter:

from agentflow.qa.evaluation.reporters import BaseReporter

class SlackReporter(BaseReporter):
async def generate(self, report, output_dir):
await post_to_slack(f"Mean score: {report.mean_score:.2f}")

CLI

agentflow eval                          # discovers *_eval.py / eval_*.py
agentflow eval evals/qa_eval.py # specific file
agentflow eval --parallel # concurrent case execution
agentflow eval --max-concurrency 4
agentflow eval --threshold 0.8 # exit 1 if below threshold
agentflow eval --open # open HTML report in browser

User Simulation — LLM as user

UserSimulator drives a multi-turn conversation with your agent using an LLM to generate realistic user messages. Define goals; the simulator checks whether each goal is achieved across the full transcript.

ConversationScenario

from agentflow.qa.evaluation import ConversationScenario

scenario = ConversationScenario(
scenario_id="support-order-late",
starting_prompt="My order hasn't arrived and it's been two weeks.",
conversation_plan=[
"Ask about the order status",
"Provide order number when asked",
"Request a refund if not resolved",
],
goals=[
"Agent acknowledges the delay",
"Agent offers a resolution (refund or reship)",
"Conversation ends politely",
],
max_turns=8,
)

UserSimulator

from agentflow.qa.evaluation import UserSimulator, UserSimulatorConfig

simulator = UserSimulator(
compiled_graph=compiled,
config=UserSimulatorConfig(
simulator_model="gpt-4o",
thread_id_prefix="sim-run-",
),
)

result = await simulator.run(scenario)
print(result.goals_achieved) # list of goal strings that passed
print(result.score) # fraction of goals achieved

BatchSimulator — concurrent scenarios

from agentflow.qa.evaluation import BatchSimulator

batch = BatchSimulator(compiled_graph=compiled, max_concurrency=4)
results = await batch.run([scenario1, scenario2, scenario3])

Each scenario gets an isolated thread_id; runs are fully concurrent up to max_concurrency.

SimulationGoalsCriterion

Use SimulationGoalsCriterion when you want to include simulation results in a standard EvalReport. It scores goal achievement across the full transcript — not just the last message.

from agentflow.qa.evaluation.criteria import SimulationGoalsCriterion

# Include in EvalConfig, not in regular criteria lists — it requires a full transcript
config = EvalConfig(
criteria=[SimulationGoalsCriterion(simulator_model="gpt-4o-mini")],
)

Note: SimulationGoalsCriterion must not be mixed with per-message criteria in the same EvalConfig. Run it in a dedicated evaluation pass.

Exposing scenarios to the CLI

Name a list SCENARIOS or define get_scenarios() in your eval file:

# evals/simulation_eval.py
SCENARIOS = [scenario1, scenario2]

# or
def get_scenarios():
return load_scenarios_from_db()
agentflow eval evals/simulation_eval.py

Wiring into observability

Eval and testing tell you whether your agent is correct in controlled conditions. For live production monitoring, two extension points feed execution data to external systems:

  • GraphLifecycleHook — fires on graph start/end, each state update, checkpoint, interrupt, and error. Use it to open OpenTelemetry spans, record Datadog/Prometheus metrics, or redact PII before state is persisted. Full details in Agents, Tools & Control.
  • BasePublisher — emits an EventModel on every execution event (node start, LLM call, tool call, state update, completion) to Kafka, Redis, RabbitMQ, or a custom backend. Full details in Serving Agents.

Go deeper

GuideLink
Write your first evalHow-To: Evaluation
Custom evaluation criteriaExtensibility
Lifecycle hooks and callbacksAgents, Tools & Control
Publisher backendsServing Agents