Evaluation Reports
After every evaluation run, AgentFlow generates structured reports so you can understand what passed, what failed, and why. Reports are always generated by default — no flags required.
Report types
| Format | File | Best for |
|---|---|---|
| HTML | eval_reports/<id>_<timestamp>.html | Human review — visual dashboard |
| JSON | eval_reports/<id>_<timestamp>.json | Programmatic analysis, dashboards |
| JUnit XML | eval_reports/<id>_<timestamp>.xml | CI systems (GitHub Actions test summary) |
| Console | stdout | Quick local feedback during development |
HTML report
The HTML report is a self-contained dashboard that opens in any browser. It contains:
- Summary header — eval set name, total cases, passed/failed count, overall pass rate
- Criterion bar chart — a breakdown of scores for each active criterion across all cases
- Per-case detail — expandable sections showing the query, actual response, expected response, and score for each criterion per case
Open it manually:
open eval_reports/weather-agent-regression_20260519_142301.html
Or open automatically after the run:
agentflow eval --open
JSON report
The JSON report contains every piece of data from the run in a machine-readable format. You can consume it in post-processing scripts, ingest it into a metrics dashboard, or archive it for trend analysis.
Structure (simplified):
{
"eval_set_id": "weather-agent",
"eval_set_name": "Weather Agent Regression",
"summary": {
"total_cases": 5,
"passed_cases": 4,
"failed_cases": 1,
"pass_rate": 0.8,
"criteria_scores": {
"tool_trajectory_avg_score": 1.0,
"rouge_match": 0.72
}
},
"results": [
{
"case_id": "london_weather",
"passed": true,
"criteria_results": {
"tool_trajectory_avg_score": {"score": 1.0, "passed": true},
"rouge_match": {"score": 0.65, "passed": true}
},
"actual_response": "The weather in London is sunny.",
"tool_calls": [{"name": "get_weather", "args": {"location": "London"}}]
}
]
}
JUnit XML report
JUnit XML output lets CI systems (GitHub Actions, Jenkins, GitLab CI) display eval results as structured test results — with pass/fail summaries, per-case details, and failure messages in the test tab.
Enable it via ReporterConfig:
from agentflow.qa.evaluation import ReporterConfig, ReporterManager
manager = ReporterManager(
ReporterConfig(
output_dir="eval_reports",
junit_xml=True,
html=True,
json_report=True,
)
)
manager.run_all(report)
Or via the agentflow eval command — JUnit XML is off by default. You can enable it programmatically in your run() function.
GitHub Actions example using the JUnit XML output:
- name: Run evaluations
run: agentflow eval
- name: Publish eval results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: eval_reports/*.xml
EvalReport object
When running programmatically, AgentEvaluator.evaluate() returns an EvalReport:
from agentflow.qa.evaluation import AgentEvaluator, EvalPresets
evaluator = AgentEvaluator(app, collector, config=EvalPresets.tool_usage())
report = await evaluator.evaluate(eval_set)
Summary
summary = report.summary
summary.total_cases # int — total number of cases
summary.passed_cases # int — cases where all criteria passed
summary.failed_cases # int — cases where at least one criterion failed
summary.pass_rate # float 0.0–1.0
summary.criteria_scores # dict[str, float] — average score per criterion
Per-case results
for result in report.results:
print(result.case_id, "PASS" if result.passed else "FAIL")
for criterion_key, cr in result.criteria_results.items():
print(f" {criterion_key}: {cr.score:.2f} ({'PASS' if cr.passed else 'FAIL'})")
Print to console
from agentflow.qa.evaluation import print_report
print_report(report)
ReporterConfig reference
ReporterConfig controls all automatic reporting. Pass it to ReporterManager when generating reports manually.
from agentflow.qa.evaluation import ReporterConfig
config = ReporterConfig(
enabled=True,
output_dir="eval_reports",
console=True,
json_report=True,
html=True,
junit_xml=False,
verbose=True,
include_details=True,
include_trajectory=True,
include_node_responses=True,
include_actual_response=True,
include_tool_call_details=True,
timestamp_files=True,
)
| Field | Default | Description |
|---|---|---|
enabled | True | Master switch — when False, no reporters run |
output_dir | eval_reports | Directory for generated files |
console | True | Print summary to stdout |
json_report | True | Write JSON file |
html | True | Write HTML file |
junit_xml | False | Write JUnit XML file |
verbose | True | Show all cases; when False, show only failures |
include_details | True | Include per-criterion details in file reports |
include_trajectory | True | Include tool call trajectory in JSON |
include_node_responses | True | Include per-node intermediate data |
include_actual_response | True | Include the agent's final response |
include_tool_call_details | True | Include tool arguments and results |
timestamp_files | True | Append timestamp so runs do not overwrite each other |
Running reporters manually
from agentflow.qa.evaluation import ReporterManager, ReporterConfig
manager = ReporterManager(
ReporterConfig(
output_dir="my_reports",
html=True,
json_report=True,
console=False,
timestamp_files=True,
)
)
output = manager.run_all(report)
if output.html_path:
print(f"HTML: {output.html_path}")
if output.json_path:
print(f"JSON: {output.json_path}")
if output.has_errors:
for name, err in output.errors:
print(f"Reporter error [{name}]: {err}")
Combined reports from multiple eval files
When agentflow eval runs more than one eval file, results from all files are merged into a single combined report. The merged report has eval_set_id="combined_eval" and contains all cases from all files.
If you run multiple files manually, merge them the same way:
from agentflow.qa.evaluation import EvalReport
merged = EvalReport.create(
eval_set_id="combined",
eval_set_name="Combined Evaluation",
results=[*report_a.results, *report_b.results],
)
Interpreting results
Pass rate
A case passes only when all enabled criteria meet their thresholds. The pass rate is passed_cases / total_cases.
1.0(100%) — every criterion passed for every case< 1.0— at least one case has at least one criterion below threshold
The agentflow eval CLI exits with code 1 whenever pass rate is below 1.0 or below the configured threshold. Code 0 means a perfect 100% pass rate (or threshold was met and no errors occurred).
Diagnosing failures
- Open the HTML report and look at the per-case criterion bars.
- Find which criterion scored lowest.
- If it is a trajectory criterion, the trajectory section shows which tool calls were expected vs actual.
- If it is an LLM criterion, the detail section shows the judge's reasoning (when
include_details=True). - Adjust the eval set (add missing tool expectations), tighten the prompt, or revise the threshold.
Next steps
- Eval sets — defining test cases
- Criteria reference — understanding each score
- Presets — choosing the right config
- How to run evaluations — CLI reference