Skip to main content

Evaluation Reports

After every evaluation run, AgentFlow generates structured reports so you can understand what passed, what failed, and why. Reports are always generated by default — no flags required.


Report types

FormatFileBest for
HTMLeval_reports/<id>_<timestamp>.htmlHuman review — visual dashboard
JSONeval_reports/<id>_<timestamp>.jsonProgrammatic analysis, dashboards
JUnit XMLeval_reports/<id>_<timestamp>.xmlCI systems (GitHub Actions test summary)
ConsolestdoutQuick local feedback during development

HTML report

The HTML report is a self-contained dashboard that opens in any browser. It contains:

  • Summary header — eval set name, total cases, passed/failed count, overall pass rate
  • Criterion bar chart — a breakdown of scores for each active criterion across all cases
  • Per-case detail — expandable sections showing the query, actual response, expected response, and score for each criterion per case

Open it manually:

open eval_reports/weather-agent-regression_20260519_142301.html

Or open automatically after the run:

agentflow eval --open

JSON report

The JSON report contains every piece of data from the run in a machine-readable format. You can consume it in post-processing scripts, ingest it into a metrics dashboard, or archive it for trend analysis.

Structure (simplified):

{
"eval_set_id": "weather-agent",
"eval_set_name": "Weather Agent Regression",
"summary": {
"total_cases": 5,
"passed_cases": 4,
"failed_cases": 1,
"pass_rate": 0.8,
"criteria_scores": {
"tool_trajectory_avg_score": 1.0,
"rouge_match": 0.72
}
},
"results": [
{
"case_id": "london_weather",
"passed": true,
"criteria_results": {
"tool_trajectory_avg_score": {"score": 1.0, "passed": true},
"rouge_match": {"score": 0.65, "passed": true}
},
"actual_response": "The weather in London is sunny.",
"tool_calls": [{"name": "get_weather", "args": {"location": "London"}}]
}
]
}

JUnit XML report

JUnit XML output lets CI systems (GitHub Actions, Jenkins, GitLab CI) display eval results as structured test results — with pass/fail summaries, per-case details, and failure messages in the test tab.

Enable it via ReporterConfig:

from agentflow.qa.evaluation import ReporterConfig, ReporterManager

manager = ReporterManager(
ReporterConfig(
output_dir="eval_reports",
junit_xml=True,
html=True,
json_report=True,
)
)
manager.run_all(report)

Or via the agentflow eval command — JUnit XML is off by default. You can enable it programmatically in your run() function.

GitHub Actions example using the JUnit XML output:

- name: Run evaluations
run: agentflow eval

- name: Publish eval results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: eval_reports/*.xml

EvalReport object

When running programmatically, AgentEvaluator.evaluate() returns an EvalReport:

from agentflow.qa.evaluation import AgentEvaluator, EvalPresets

evaluator = AgentEvaluator(app, collector, config=EvalPresets.tool_usage())
report = await evaluator.evaluate(eval_set)

Summary

summary = report.summary

summary.total_cases # int — total number of cases
summary.passed_cases # int — cases where all criteria passed
summary.failed_cases # int — cases where at least one criterion failed
summary.pass_rate # float 0.0–1.0
summary.criteria_scores # dict[str, float] — average score per criterion

Per-case results

for result in report.results:
print(result.case_id, "PASS" if result.passed else "FAIL")

for criterion_key, cr in result.criteria_results.items():
print(f" {criterion_key}: {cr.score:.2f} ({'PASS' if cr.passed else 'FAIL'})")
from agentflow.qa.evaluation import print_report

print_report(report)

ReporterConfig reference

ReporterConfig controls all automatic reporting. Pass it to ReporterManager when generating reports manually.

from agentflow.qa.evaluation import ReporterConfig

config = ReporterConfig(
enabled=True,
output_dir="eval_reports",
console=True,
json_report=True,
html=True,
junit_xml=False,
verbose=True,
include_details=True,
include_trajectory=True,
include_node_responses=True,
include_actual_response=True,
include_tool_call_details=True,
timestamp_files=True,
)
FieldDefaultDescription
enabledTrueMaster switch — when False, no reporters run
output_direval_reportsDirectory for generated files
consoleTruePrint summary to stdout
json_reportTrueWrite JSON file
htmlTrueWrite HTML file
junit_xmlFalseWrite JUnit XML file
verboseTrueShow all cases; when False, show only failures
include_detailsTrueInclude per-criterion details in file reports
include_trajectoryTrueInclude tool call trajectory in JSON
include_node_responsesTrueInclude per-node intermediate data
include_actual_responseTrueInclude the agent's final response
include_tool_call_detailsTrueInclude tool arguments and results
timestamp_filesTrueAppend timestamp so runs do not overwrite each other

Running reporters manually

from agentflow.qa.evaluation import ReporterManager, ReporterConfig

manager = ReporterManager(
ReporterConfig(
output_dir="my_reports",
html=True,
json_report=True,
console=False,
timestamp_files=True,
)
)

output = manager.run_all(report)

if output.html_path:
print(f"HTML: {output.html_path}")
if output.json_path:
print(f"JSON: {output.json_path}")
if output.has_errors:
for name, err in output.errors:
print(f"Reporter error [{name}]: {err}")

Combined reports from multiple eval files

When agentflow eval runs more than one eval file, results from all files are merged into a single combined report. The merged report has eval_set_id="combined_eval" and contains all cases from all files.

If you run multiple files manually, merge them the same way:

from agentflow.qa.evaluation import EvalReport

merged = EvalReport.create(
eval_set_id="combined",
eval_set_name="Combined Evaluation",
results=[*report_a.results, *report_b.results],
)

Interpreting results

Pass rate

A case passes only when all enabled criteria meet their thresholds. The pass rate is passed_cases / total_cases.

  • 1.0 (100%) — every criterion passed for every case
  • < 1.0 — at least one case has at least one criterion below threshold

The agentflow eval CLI exits with code 1 whenever pass rate is below 1.0 or below the configured threshold. Code 0 means a perfect 100% pass rate (or threshold was met and no errors occurred).

Diagnosing failures

  1. Open the HTML report and look at the per-case criterion bars.
  2. Find which criterion scored lowest.
  3. If it is a trajectory criterion, the trajectory section shows which tool calls were expected vs actual.
  4. If it is an LLM criterion, the detail section shows the judge's reasoning (when include_details=True).
  5. Adjust the eval set (add missing tool expectations), tighten the prompt, or revise the threshold.

Next steps