Skip to main content

Testing and evaluation

AgentFlow has two complementary quality layers:

  • Testing verifies graph mechanics, state updates, tool calls, and error handling without live models.
  • Evaluation measures behavior quality, trajectory matching, safety, and regression cases.

Testing helpers

Use agentflow.qa.testing for fast, deterministic tests.

HelperPurpose
TestAgentModel-free agent double that returns predefined responses or tool calls.
QuickTestQuick single-turn, multi-turn, tool, and graph tests.
TestResultFluent assertions for text, tool calls, messages, and errors.
TestContextContext manager for creating graphs, agents, stores, and mocks.
MockToolRegistryRegister sync or async mock tools and assert calls.
MockMCPClientMock MCP tool listing and calls.
InMemoryStoreDeterministic memory store for tests.

Prefer these helpers over live provider calls in unit tests.

Evaluation framework

Use agentflow.qa.evaluation for datasets, criteria, and reports.

Core pieces include:

AreaExamples
DatasetsEvalCase, EvalSet, EvalSetBuilder
RunnersAgentEvaluator, QuickEval, run_eval
CriteriaResponse, trajectory, tool, safety, factual accuracy, hallucination, rubric, LLM judge
ReportsConsole, JSON, HTML, JUnit XML

Evaluation is the better fit for behavior quality, safety policy, and workflow regressions that are hard to express as ordinary unit assertions.

Trajectory checks

Trajectory evaluation records node and tool execution, then compares it with expected paths. Use this when the answer text is less important than whether the graph used the right route, tool, or handoff.

Rules

RuleWhy it matters
Keep unit tests model-freeFast tests should be deterministic.
Isolate live provider testsThey are slower, cost money, and can be flaky.
Use evals for behavior qualityEvals can compare outputs, paths, safety, and rubrics.
Compile once when collecting trajectoriesCallback state should not be lost between cases.