Skip to main content

Building Eval Sets

An EvalSet is a named collection of test cases (EvalCase). You build one using the fluent EvalSetBuilder API, load it from a JSON file, or create it inline.

EvalSetBuilder

Single-turn case

from agentflow.qa.evaluation import EvalSetBuilder

eval_set = (
EvalSetBuilder("customer-support")
.add_case(
query="How do I reset my password?",
expected="visit the account settings page",
case_id="reset_password",
)
.build()
)

query is the user's input. expected is the response the agent should produce (used by response-matching criteria). case_id is optional — auto-generated if omitted.

Case with expected tools

Provide expected_tools as a list of tool names or ToolCall objects:

from agentflow.qa.evaluation import EvalSetBuilder, ToolCall

eval_set = (
EvalSetBuilder("weather-agent")
.add_case(
query="Weather in London?",
expected="The weather in London is sunny.",
expected_tools=["get_weather"], # tool names only
)
.add_case(
query="Weather in Tokyo?",
expected="Raining in Tokyo.",
expected_tools=[
ToolCall(name="get_weather", args={"location": "Tokyo"}),
], # with expected args
)
.build()
)

When expected_tools contains ToolCall objects with args, trajectory criteria can optionally verify the arguments too (controlled by check_args in CriterionConfig).

Case with expected node order

Use expected_node_order to verify the graph visited nodes in a specific sequence:

.add_case(
query="Search and summarise the latest AI news",
expected="Here is a summary...",
expected_tools=["search_web"],
expected_node_order=["MAIN", "TOOL", "MAIN"],
)

Focused tool test

add_tool_test is a shortcut for cases where a single tool call is the primary concern:

.add_tool_test(
query="What is the weather in Berlin?",
tool_name="get_weather",
tool_args={"location": "Berlin"},
expected_response="Berlin", # optional; defaults to "Result from get_weather"
case_id="berlin_weather",
)

Multi-turn conversation

.add_multi_turn(
conversation=[
("Hello", "Hi! How can I help?"),
("What can you do?", "I can check weather, search the web, and more."),
("Check weather in Paris", "It is 18°C in Paris."),
],
expected_tools=["get_weather"],
case_id="multi_turn_weather",
)

The evaluator runs the agent for each turn and scores the final state.

Quick builder from pairs

from agentflow.qa.evaluation import EvalSetBuilder

eval_set = EvalSetBuilder.quick(
("Hello", "Hi!"),
("What is 2+2?", "4"),
("Capital of France?", "Paris"),
)

From a list of conversations

conversations = [
{"user": "Hello", "assistant": "Hi!"},
{"user": "Bye", "assistant": "Goodbye!"},
]
eval_set = EvalSetBuilder.from_conversations(conversations, name="smoke-tests")

Save and load

Save to JSON

# Build and save in one step
eval_set = EvalSetBuilder("weather-tests").add_case(...).save("evals/weather.json")

# Or save an existing eval set
eval_set.to_file("evals/weather.json")

Load from JSON

from agentflow.qa.evaluation import EvalSet

eval_set = EvalSet.from_file("evals/weather.json")

Load and modify

builder = EvalSetBuilder.from_file("evals/weather.json")
builder.add_case(query="Weather in Seoul?", expected="Seoul")
eval_set = builder.build()

EvalCase directly

You can construct EvalCase objects without the builder:

from agentflow.qa.evaluation import EvalCase, ToolCall

case = EvalCase.single_turn(
eval_id="london_weather",
user_query="Weather in London?",
expected_response="It is sunny in London.",
expected_tools=[ToolCall(name="get_weather", args={"location": "London"})],
expected_node_order=["MAIN", "TOOL", "MAIN"],
name="London weather check",
description="Verifies the weather tool is called for London queries",
)

Multi-turn:

case = EvalCase.multi_turn(
eval_id="multi_weather",
conversation=[
("Weather in London?", "Sunny."),
("And in Tokyo?", "Rainy."),
],
expected_tools=[ToolCall(name="get_weather")],
)

ToolCall reference

from agentflow.qa.evaluation import ToolCall

# Name only (args not checked)
ToolCall(name="get_weather")

# With args (args checked when check_args=True in CriterionConfig)
ToolCall(name="get_weather", args={"location": "London"})

ToolCall.args defaults to {}. Argument checking is off by default — enable it per criterion with CriterionConfig.trajectory(check_args=True).


EvalSet structure

AttributeTypeDescription
eval_set_idstrUUID auto-generated by the builder
namestrHuman-readable label shown in reports
eval_caseslist[EvalCase]The test cases
EvalCase attributeTypeDescription
eval_idstrUnique ID for this case
namestrOptional label shown in reports
descriptionstrOptional details shown in reports
expected_responsestrWhat the final response should say
expected_toolslist[ToolCall] | NoneTool calls expected during execution
expected_node_orderlist[str] | NoneNode visit sequence expected
invocationslist[Invocation]One per conversation turn

Next steps