Evaluation Reports

After every evaluation run, AgentFlow generates structured reports so you can understand what passed, what failed, and why. Reports are always generated by default — no flags required.

Report types

Format	File	Best for
HTML	`eval_reports/<id>_<timestamp>.html`	Human review — visual dashboard
JSON	`eval_reports/<id>_<timestamp>.json`	Programmatic analysis, dashboards
JUnit XML	`eval_reports/<id>_<timestamp>.xml`	CI systems (GitHub Actions test summary)
Console	stdout	Quick local feedback during development

HTML report

The HTML report is a self-contained dashboard that opens in any browser. It contains:

Summary header — eval set name, total cases, passed/failed count, overall pass rate
Criterion bar chart — a breakdown of scores for each active criterion across all cases
Per-case detail — expandable sections showing the query, actual response, expected response, and score for each criterion per case

Open it manually:

open eval_reports/weather-agent-regression_20260519_142301.html

Or open automatically after the run:

agentflow eval --open

JSON report

The JSON report contains every piece of data from the run in a machine-readable format. You can consume it in post-processing scripts, ingest it into a metrics dashboard, or archive it for trend analysis.

Structure (simplified):

{
  "eval_set_id": "weather-agent",
  "eval_set_name": "Weather Agent Regression",
  "summary": {
    "total_cases": 5,
    "passed_cases": 4,
    "failed_cases": 1,
    "pass_rate": 0.8,
    "criteria_scores": {
      "tool_trajectory_avg_score": 1.0,
      "rouge_match": 0.72
    }
  },
  "results": [
    {
      "case_id": "london_weather",
      "passed": true,
      "criteria_results": {
        "tool_trajectory_avg_score": {"score": 1.0, "passed": true},
        "rouge_match": {"score": 0.65, "passed": true}
      },
      "actual_response": "The weather in London is sunny.",
      "tool_calls": [{"name": "get_weather", "args": {"location": "London"}}]
    }
  ]
}

JUnit XML report

JUnit XML output lets CI systems (GitHub Actions, Jenkins, GitLab CI) display eval results as structured test results — with pass/fail summaries, per-case details, and failure messages in the test tab.

Enable it via ReporterConfig:

from agentflow.qa.evaluation import ReporterConfig, ReporterManager

manager = ReporterManager(
    ReporterConfig(
        output_dir="eval_reports",
        junit_xml=True,
        html=True,
        json_report=True,
    )
)
manager.run_all(report)

Or via the agentflow eval command — JUnit XML is off by default. You can enable it programmatically in your run() function.

GitHub Actions example using the JUnit XML output:

- name: Run evaluations
  run: agentflow eval

- name: Publish eval results
  uses: EnricoMi/publish-unit-test-result-action@v2
  if: always()
  with:
    files: eval_reports/*.xml

EvalReport object

When running programmatically, AgentEvaluator.evaluate() returns an EvalReport:

from agentflow.qa.evaluation import AgentEvaluator, EvalPresets

evaluator = AgentEvaluator(app, collector, config=EvalPresets.tool_usage())
report = await evaluator.evaluate(eval_set)

Summary

summary = report.summary

summary.total_cases       # int — total number of cases
summary.passed_cases      # int — cases where all criteria passed
summary.failed_cases      # int — cases where at least one criterion failed
summary.pass_rate         # float 0.0–1.0
summary.criteria_scores   # dict[str, float] — average score per criterion

Per-case results

for result in report.results:
    print(result.case_id, "PASS" if result.passed else "FAIL")

    for criterion_key, cr in result.criteria_results.items():
        print(f"  {criterion_key}: {cr.score:.2f} ({'PASS' if cr.passed else 'FAIL'})")

Print to console

from agentflow.qa.evaluation import print_report

print_report(report)

ReporterConfig reference

ReporterConfig controls all automatic reporting. Pass it to ReporterManager when generating reports manually.

from agentflow.qa.evaluation import ReporterConfig

config = ReporterConfig(
    enabled=True,
    output_dir="eval_reports",
    console=True,
    json_report=True,
    html=True,
    junit_xml=False,
    verbose=True,
    include_details=True,
    include_trajectory=True,
    include_node_responses=True,
    include_actual_response=True,
    include_tool_call_details=True,
    timestamp_files=True,
)

Field	Default	Description
`enabled`	`True`	Master switch — when `False`, no reporters run
`output_dir`	`eval_reports`	Directory for generated files
`console`	`True`	Print summary to stdout
`json_report`	`True`	Write JSON file
`html`	`True`	Write HTML file
`junit_xml`	`False`	Write JUnit XML file
`verbose`	`True`	Show all cases; when `False`, show only failures
`include_details`	`True`	Include per-criterion details in file reports
`include_trajectory`	`True`	Include tool call trajectory in JSON
`include_node_responses`	`True`	Include per-node intermediate data
`include_actual_response`	`True`	Include the agent's final response
`include_tool_call_details`	`True`	Include tool arguments and results
`timestamp_files`	`True`	Append timestamp so runs do not overwrite each other

Running reporters manually

from agentflow.qa.evaluation import ReporterManager, ReporterConfig

manager = ReporterManager(
    ReporterConfig(
        output_dir="my_reports",
        html=True,
        json_report=True,
        console=False,
        timestamp_files=True,
    )
)

output = manager.run_all(report)

if output.html_path:
    print(f"HTML: {output.html_path}")
if output.json_path:
    print(f"JSON: {output.json_path}")
if output.has_errors:
    for name, err in output.errors:
        print(f"Reporter error [{name}]: {err}")

Combined reports from multiple eval files

When agentflow eval runs more than one eval file, results from all files are merged into a single combined report. The merged report has eval_set_id="combined_eval" and contains all cases from all files.

If you run multiple files manually, merge them the same way:

from agentflow.qa.evaluation import EvalReport

merged = EvalReport.create(
    eval_set_id="combined",
    eval_set_name="Combined Evaluation",
    results=[*report_a.results, *report_b.results],
)

Interpreting results

Pass rate

A case passes only when all enabled criteria meet their thresholds. The pass rate is passed_cases / total_cases.

1.0 (100%) — every criterion passed for every case
< 1.0 — at least one case has at least one criterion below threshold

The agentflow eval CLI exits with code 1 whenever pass rate is below 1.0 or below the configured threshold. Code 0 means a perfect 100% pass rate (or threshold was met and no errors occurred).

Diagnosing failures

Open the HTML report and look at the per-case criterion bars.
Find which criterion scored lowest.
If it is a trajectory criterion, the trajectory section shows which tool calls were expected vs actual.
If it is an LLM criterion, the detail section shows the judge's reasoning (when include_details=True).
Adjust the eval set (add missing tool expectations), tighten the prompt, or revise the threshold.

Next steps

Eval sets — defining test cases
Criteria reference — understanding each score
Presets — choosing the right config
How to run evaluations — CLI reference

Report types​

HTML report​

JSON report​

JUnit XML report​

EvalReport object​

Summary​

Per-case results​

Print to console​

ReporterConfig reference​

Running reporters manually​

Combined reports from multiple eval files​

Interpreting results​

Pass rate​

Diagnosing failures​

Next steps​