Skip to main content

Run evaluations

The agentflow eval command discovers evaluation files in your project, runs all cases under a single async event loop, and always generates an HTML and JSON report. No flags required — reports are on by default.

Prerequisites

Your project must have been initialised with agentflow init. Eval files live in the evals/ directory, which is generated when you choose the Production setup during agentflow init.

Quick start

From the folder that contains agentflow.json:

agentflow eval

This scans evals/ for files matching *_eval.py or eval_*.py, collects every case from every file into a flat pool, runs them, and writes reports to eval_reports/:

eval_reports/
weather-agent-regression_20260513_142301.html
weather-agent-regression_20260513_142301.json

Run a specific file or directory

# One file
agentflow eval evals/weather_agents_eval.py

# A subdirectory
agentflow eval evals/regression/

When a file is given, only that file runs. When a directory is given, all matching files are discovered. Results from all files are merged into a single combined report.

Run in parallel

By default all cases run sequentially. Pass --parallel to run them concurrently:

agentflow eval --parallel
agentflow eval --parallel --max-concurrency 8

How it works: all cases from all files are collected first into a single flat pool. One asyncio event loop runs the entire pool under a single semaphore capped at --max-concurrency. Cases complete out of order — that is expected.

[  1/50] weather_agents_eval.py::weather_london      PASSED   1.23s
[ 3/50] booking_eval.py::book_flight_london PASSED 2.10s
[ 2/50] weather_agents_eval.py::weather_new_york PASSED 0.98s
...
[ 50/50] ...

Results: 47/50 passed (94.0%)

You can also enable parallel by default in agentflow.json (see Configure defaults).

Reports

Every run produces two files in eval_reports/ (or the directory set by --output):

FileContents
<eval-id>_<timestamp>.htmlVisual dashboard: summary cards, criterion bars, per-case details
<eval-id>_<timestamp>.jsonMachine-readable results for CI tooling or custom analysis

Console output is always printed as cases complete. The HTML/JSON files are written after all cases finish.

Open the report automatically

agentflow eval --open

Skip file output

agentflow eval --no-report

Only console output is produced. Useful for fast local feedback.

Set a pass-rate threshold

agentflow eval --threshold 0.8

The command exits with a non-zero code if the overall pass rate is below the threshold. Useful in CI to gate merges on eval quality.

Write reports to a custom directory

agentflow eval --output ci/reports

Configure defaults in agentflow.json

Add an evaluation section to agentflow.json to set project-level defaults. CLI flags always take precedence.

{
"agent": "graph.agent:app",
"evaluation": {
"directory": "evals",
"output_dir": "eval_reports",
"threshold": 0.75,
"parallel": false,
"max_concurrency": 4,
"timestamp_files": true
}
}
FieldDescription
directoryDirectory scanned when no TARGET argument is given
output_dirDirectory where report files are written
thresholdMinimum pass rate required for a zero exit code
parallelRun all cases from all files in a flat parallel pool
max_concurrencyMaximum cases running at once when parallel is true
timestamp_filesAppend timestamp to filenames so runs do not overwrite each other

Enforce threshold in CI

# .github/workflows/ci.yml
- name: Run evaluations
run: agentflow eval --parallel

Set threshold in agentflow.json. If the pass rate drops below it, the step fails without extra flags.


Eval file protocols

An eval file is any *_eval.py or eval_*.py file. The CLI auto-detects which protocol you are using. Pick the one that fits your use case.

Summary

ProtocolWhen to use
get_eval_set()Standard: fixed prompt/response pairs
get_eval_config() / EVAL_CONFIGOverride criteria per file
EvalPresetsRecommended: one-line preset configs
Annotated functions -> EvalSetPytest-style discovery, multiple sets per file
get_scenarios() / SCENARIOSUser simulator: dynamic multi-turn conversations
confeval.pyGlobal criteria applied to all files that have no per-file config

get_eval_set() — minimum required

The CLI loads the agent from agentflow.json, applies default criteria (60% threshold on all), runs the evaluation, and writes reports. You only define the cases.

# evals/weather_agents_eval.py
from agentflow.qa.evaluation import EvalSet, EvalSetBuilder


def get_eval_set() -> EvalSet:
return (
EvalSetBuilder(name="weather-agent-regression")
.add_tool_test(
query="What is the weather in London?",
tool_name="get_weather",
tool_args={"location": "London"},
expected_response="London",
case_id="weather_london",
)
.add_tool_test(
query="What is the weather in Tokyo?",
tool_name="get_weather",
tool_args={"location": "Tokyo"},
expected_response="Tokyo",
case_id="weather_tokyo",
)
.build()
)

Default criteria (applied automatically when no criteria are configured anywhere):

CriterionThresholdMatch type
response_match0.6ANY_ORDER
tool_name_match_score0.6ANY_ORDER
node_order0.6IN_ORDER

get_eval_config() — per-file criteria with EvalPresets

Add this function when you want to specify which criteria to run and what thresholds to use. The recommended approach is EvalPresets — one-line preset configs covering the most common patterns.

from agentflow.qa.evaluation import EvalConfig, EvalSet, EvalSetBuilder
from agentflow.qa.evaluation.config.presets import EvalPresets


def get_eval_config() -> EvalConfig:
return EvalPresets.tool_usage(threshold=0.6)


def get_eval_set() -> EvalSet:
return (
EvalSetBuilder(name="weather-agent-regression")
.add_tool_test(
query="What is the weather in London?",
tool_name="get_weather",
tool_args={"location": "London"},
expected_response="London",
case_id="weather_london",
)
.build()
)

Available presets:

PresetWhat it checks
EvalPresets.response_quality(threshold)LLM judge on response accuracy
EvalPresets.tool_usage(threshold)Tool calls correct + response quality
EvalPresets.conversation_flow(threshold)Multi-turn conversation evaluation
EvalPresets.comprehensive(threshold)All of the above combined
EvalPresets.quick_check(threshold)Fast ROUGE-based check, no LLM cost

You can also combine presets:

from agentflow.qa.evaluation.config.presets import EvalPresets


def get_eval_config():
return EvalPresets.combine(
EvalPresets.tool_usage(threshold=0.7),
EvalPresets.response_quality(threshold=0.6),
)

EVAL_CONFIG — constant instead of function

Same effect as get_eval_config() but as a module-level constant. Useful when the config is static.

from agentflow.qa.evaluation.config.presets import EvalPresets

EVAL_CONFIG = EvalPresets.tool_usage(threshold=0.6)

confeval.py — global eval config

Place a file named exactly confeval.py in your project root (next to agentflow.json) to set a global default EvalConfig that applies to every eval file that does not define its own get_eval_config() or EVAL_CONFIG. If a file does provide its own config, that takes precedence and confeval.py is ignored for that file.

The file must expose either a module-level EVAL_CONFIG variable or a callable get_eval_config() that returns an EvalConfig.

# confeval.py  (project root, next to agentflow.json)
from agentflow.qa.evaluation import CriteriaConfig, CriterionConfig, EvalConfig


EVAL_CONFIG = EvalConfig(
criteria=CriteriaConfig(
tool_name_match=CriterionConfig.tool_name_match(threshold=1.0),
# response_match=CriterionConfig.response_match(threshold=0.8),
# hallucinations=CriterionConfig.hallucination(threshold=0.8),
rouge_match=CriterionConfig.rouge_match(threshold=0.5),
)
)

Or as a function:

# confeval.py
from agentflow.qa.evaluation import CriteriaConfig, CriterionConfig, EvalConfig


def get_eval_config() -> EvalConfig:
return EvalConfig(
criteria=CriteriaConfig(
tool_name_match=CriterionConfig.tool_name_match(threshold=1.0),
rouge_match=CriterionConfig.rouge_match(threshold=0.5),
)
)

If confeval.py is absent and a file has no per-file config, the built-in defaults apply (all criteria at 0.6 threshold).


Annotated functions -> EvalSet — pytest-style discovery

Any module-level function with return type -> EvalSet is auto-discovered as an eval set. Useful when you want multiple named eval sets in one file.

from agentflow.qa.evaluation import EvalSet, EvalSetBuilder
from agentflow.qa.evaluation.config.presets import EvalPresets


def get_eval_config():
return EvalPresets.tool_usage(threshold=0.6)


def weather_cases() -> EvalSet:
return EvalSetBuilder(name="weather").add_tool_test(...).build()


def booking_cases() -> EvalSet:
return EvalSetBuilder(name="booking").add_tool_test(...).build()

Both weather_cases and booking_cases are discovered and run. Their results appear as separate eval sets in the report.


get_scenarios() — user simulator

Use this protocol when you want the LLM to drive a dynamic multi-turn conversation against your agent rather than using fixed prompt/response pairs.

You only define the scenarios. The CLI handles running the simulator, scoring goal achievement, and writing the report — identical to regular eval cases.

# evals/user_simulator_eval.py
from agentflow.qa.evaluation import ConversationScenario, UserSimulatorConfig


# Optional: override simulator model and settings for this file.
# If omitted, the CLI uses UserSimulatorConfig defaults (gemini-2.5-flash).
SIMULATOR_CONFIG = UserSimulatorConfig(
model="gemini/gemini-2.5-flash",
max_invocations=8,
temperature=0.7,
)


def get_scenarios() -> list[ConversationScenario]:
return [
ConversationScenario(
scenario_id="weather_travel_planning",
description="User planning a trip wants weather info and packing advice",
starting_prompt="Hi! I'm planning a trip to Paris this weekend.",
conversation_plan=(
"1. Ask about current weather in Paris\n"
"2. Ask whether to bring a jacket\n"
"3. Ask about outdoor sightseeing timing"
),
goals=[
"User receives weather information for Paris",
"User gets clothing or packing advice",
"User learns about outdoor activity timing",
],
max_turns=8,
),
ConversationScenario(
scenario_id="flight_booking",
description="User wants help finding a flight",
starting_prompt="I need to fly from London to New York next Friday.",
goals=[
"User receives flight options",
"User gets pricing information",
],
max_turns=10,
),
]

How it works:

  1. The CLI detects get_scenarios() (or a SCENARIOS constant) and switches to simulator mode for that file.
  2. Each ConversationScenario becomes one eval case.
  3. The simulator drives up to max_turns turns, generating contextual user messages after each agent response.
  4. SimulationGoalsCriterion uses an LLM judge to score how many goals were achieved across the full conversation.
  5. Pass/fail and the report are produced the same way as regular eval cases.

ConversationScenario fields:

FieldRequiredDescription
scenario_idYesUnique ID for the scenario (appears in the report)
descriptionNoHuman-readable name shown in the report
starting_promptYesFirst user message to kick off the conversation
conversation_planNoHints to the simulator about how to progress
goalsYesList of outcomes the user wants to achieve
max_turnsNoMaximum conversation turns (default: 10)

SIMULATOR_CONFIG fields:

FieldDefaultDescription
modelgemini/gemini-2.5-flashLLM used to generate user messages
max_invocations10Maximum turns per scenario
temperature0.7Temperature for user message generation

Config priority

When multiple sources configure the same setting, this priority applies (highest first):

1. CLI flags          (--parallel, --max-concurrency, --threshold, --output)
2. agentflow.json "evaluation" section
3. Per-file config get_eval_config() / EVAL_CONFIG (inside each eval file)
4. confeval.py get_eval_config() / EVAL_CONFIG (project-root global fallback)
5. Built-in defaults (all criteria at 0.6 threshold)

Common scenarios

Fast local check, single file, open report:

agentflow eval evals/weather_agents_eval.py --open

Parallel run with 8 concurrent cases:

agentflow eval --parallel --max-concurrency 8

Strict CI gate at 80% pass rate:

{
"evaluation": {
"threshold": 0.8,
"parallel": true,
"max_concurrency": 8
}
}
agentflow eval

Run only a regression suite in a subdirectory:

agentflow eval evals/regression/ --output reports/regression

Mix regular evals and user simulator in the same run:

evals/
weather_agents_eval.py ← get_eval_set() protocol
user_simulator_eval.py ← get_scenarios() protocol
agentflow eval --parallel

Both files are discovered, cases and scenarios are collected into the same flat pool, and results appear in a single merged report.


Common issues

"Eval directory 'evals/' not found"

  • Create an evals/ directory or pass a path explicitly: agentflow eval path/to/evals
  • Run agentflow init --prod to scaffold the standard project layout.

"No eval cases found"

  • Eval files must expose get_eval_set(), get_scenarios(), SCENARIOS, or functions annotated -> EvalSet.
  • File must match *_eval.py or eval_*.py. Rename it or pass it explicitly.

File skipped with warning

  • The file does not expose any recognised entry point. Add get_eval_set() or get_scenarios().

Exit code 1 even when all cases pass

  • Check if a threshold is set in agentflow.json or passed via --threshold. The exit code is 1 when the pass rate is below threshold or when any case fails.

Simulator scenarios always fail

  • Ensure the agent is reachable: either expose app in the eval file or set "agent" in agentflow.json.
  • Check that goals are specific enough for the LLM judge to verify. Vague goals like "have a conversation" will not score well.
  • Increase max_turns if the agent needs more exchanges to satisfy all goals.