Skip to main content

User Simulation

Standard evaluation uses fixed test cases: you define the query and the expected response. User simulation flips this — an LLM plays the role of a user and drives a real conversation with your agent, checking whether the agent achieves a set of stated goals.

This is the right tool when:

  • You want to test how the agent handles unpredictable conversation paths
  • The "correct" answer cannot be stated as a fixed string
  • You need to verify multi-turn behaviour at scale
  • You want to test edge cases without writing each one by hand

How it works

ConversationScenario

UserSimulator (LLM as user)
↓ ←→ CompiledGraph (your agent)
Turn loop

Goal checking per turn (LLM)

SimulationResult + SimulationGoalsCriterion scoring

EvalCaseResult → HTML / JSON report
  1. UserSimulator starts with starting_prompt from the scenario.
  2. It sends the message to the agent graph and collects the response.
  3. It checks (using an LLM) which goals from the scenario have been achieved.
  4. If all goals are achieved, the simulation ends with completed=True.
  5. If not, it generates the next user message based on remaining goals.
  6. This repeats until all goals are achieved or max_turns is reached.
  7. SimulationGoalsCriterion scores the full conversation transcript against the stated goals.

The simplest way to run user simulations is via the agentflow eval CLI. You only write the scenarios. The CLI handles running the simulator, scoring, and producing the report — identical to regular eval cases.

Create an eval file with get_scenarios():

# evals/user_simulator_eval.py
from agentflow.qa.evaluation import ConversationScenario, UserSimulatorConfig


# Optional: override the simulator model for this file.
# If omitted, the CLI uses UserSimulatorConfig defaults (gemini-2.5-flash).
SIMULATOR_CONFIG = UserSimulatorConfig(
model="gemini/gemini-2.5-flash",
max_invocations=8,
temperature=0.7,
)


def get_scenarios() -> list[ConversationScenario]:
return [
ConversationScenario(
scenario_id="weather_travel_planning",
description="User planning a trip wants weather info and packing advice",
starting_prompt="Hi! I'm planning a trip to Paris this weekend.",
conversation_plan=(
"1. Ask about current weather in Paris\n"
"2. Ask whether to bring a jacket\n"
"3. Ask about outdoor sightseeing timing"
),
goals=[
"User receives weather information for Paris",
"User gets clothing or packing advice",
"User learns about outdoor activity timing",
],
max_turns=8,
),
ConversationScenario(
scenario_id="flight_booking",
description="User wants help finding a flight from London to New York",
starting_prompt="I need to fly from London to New York next Friday.",
goals=[
"User receives flight options",
"User gets pricing information",
],
max_turns=10,
),
]

Run it:

agentflow eval evals/user_simulator_eval.py

# Or together with all other eval files in parallel
agentflow eval --parallel --max-concurrency 4

The CLI detects get_scenarios() (or a SCENARIOS module-level constant), runs each scenario, and produces the same HTML + JSON report as regular eval cases. Regular eval cases and simulation scenarios are mixed in the same flat pool and the same report.

What the CLI does internally:

  1. Detects get_scenarios() or SCENARIOS in the file
  2. Reads SIMULATOR_CONFIG if present, otherwise uses defaults
  3. Attaches SimulationGoalsCriterion (threshold 0.7) automatically
  4. Runs scenarios through BatchSimulator under the shared asyncio event loop
  5. Converts each SimulationResult to an EvalCaseResult — pass/fail based on goal score vs threshold
  6. Writes the report alongside all other eval cases

ConversationScenario

from agentflow.qa.evaluation import ConversationScenario

scenario = ConversationScenario(
scenario_id="travel_planning",
description="User wants to plan a weekend trip and needs weather and flight info.",
starting_prompt="I'm thinking of going somewhere warm this weekend.",
conversation_plan=(
"1. Ask about weather in potential destinations\n"
"2. Narrow down to one destination\n"
"3. Ask about flight options\n"
"4. Confirm the plan"
),
goals=[
"Get weather info for at least one destination",
"Receive flight or travel suggestions",
"Have a concrete travel plan by end of conversation",
],
max_turns=8,
)
FieldTypeDescription
scenario_idstrUnique identifier used in results and reports
descriptionstrWhat the user is trying to accomplish
starting_promptstrFirst message sent to the agent; if empty, the LLM generates one
conversation_planstrHigh-level flow description fed to the simulator LLM
goalslist[str]What must be achieved for the simulation to count as complete
max_turnsintHard cap on conversation turns (default: 10)
metadatadictArbitrary metadata passed through to results

Writing good goals:

  • Be specific: "User gets the weather temperature for London" not "User learns about weather"
  • One idea per goal — the LLM judge checks each independently
  • Goals must be verifiable from the conversation transcript alone

UserSimulator

from agentflow.qa.evaluation import UserSimulator, UserSimulatorConfig

simulator = UserSimulator(
model="gemini/gemini-2.5-flash",
temperature=0.7,
max_turns=10,
)

result = await simulator.run(graph, scenario)

Constructor parameters

ParameterDefaultDescription
modelgemini/gemini-2.5-flashLLM used to generate user messages and check goals
temperature0.7Generation temperature — higher values produce more varied user messages
max_turns10Default turn limit (overridden by scenario.max_turns)
configNonePass a UserSimulatorConfig instead of individual parameters
criteria[]List of BaseCriterion to run against the completed conversation

Model support

Model stringProvider
gemini/gemini-2.5-flashGoogle GenAI
gemini-2.5-flashGoogle GenAI
gpt-4oOpenAI
gpt-4o-miniOpenAI

If the primary provider fails, it falls back to the other. If both fail, the simulator emits a neutral message and continues.

Via UserSimulatorConfig

from agentflow.qa.evaluation import UserSimulatorConfig, UserSimulator

config = UserSimulatorConfig(
model="gemini/gemini-2.5-flash",
max_invocations=12,
temperature=0.5,
thinking_enabled=False,
)

simulator = UserSimulator(config=config)
FieldDefaultDescription
modelgemini-2.5-flashSimulator LLM
max_invocations10Maximum conversation turns
temperature0.7Generation temperature
thinking_enabledFalseEnable reasoning/thinking mode if supported
thinking_budget10240Token budget for thinking (when enabled)

SimulationResult

simulator.run() returns a SimulationResult:

result = await simulator.run(graph, scenario)

print(result.completed) # True if all goals achieved before max_turns
print(result.turns) # Number of turns that ran
print(result.goals_achieved) # List of goal strings that were met
print(result.error) # None if no error, else error message

# Full conversation transcript
for turn in result.conversation:
print(f"{turn['role'].upper()}: {turn['content']}")

# Criterion scores (if criteria were passed to UserSimulator)
print(result.criterion_scores) # {"simulation_goals": 0.8}
print(result.criterion_details) # {"simulation_goals": {"achieved_goals": [...], ...}}
AttributeTypeDescription
scenario_idstrFrom the scenario
turnsintNumber of turns that ran
conversationlist[dict]Full history: [{"role": "user"/"assistant", "content": "..."}]
goals_achievedlist[str]Goals confirmed by the LLM goal-checker
completedboolTrue when all goals achieved before max_turns
errorstr | NoneError message if simulation failed mid-way
criterion_scoresdict[str, float]Score per criterion (0.0–1.0)
criterion_detailsdict[str, Any]Full criterion output including reasoning

SimulationGoalsCriterion

SimulationGoalsCriterion is an LLM-judge criterion designed specifically for use with UserSimulator. It receives the full conversation transcript and checks whether each goal was addressed at any point — not just in the final message.

The CLI attaches this criterion automatically when it detects get_scenarios(). For programmatic use:

from agentflow.qa.evaluation import (
SimulationGoalsCriterion,
CriterionConfig,
UserSimulator,
)

judge = SimulationGoalsCriterion(
config=CriterionConfig(
threshold=0.7,
judge_model="gemini-2.5-flash",
)
)

simulator = UserSimulator(
model="gemini/gemini-2.5-flash",
criteria=[judge],
)

result = await simulator.run(graph, scenario)
# result.criterion_scores["simulation_goals"] → 0.67 (2 of 3 goals met)

Score: achieved_goals / total_goals

The criterion details include:

  • achieved_goals — list of goals confirmed as addressed
  • unachieved_goals — list of goals not addressed
  • reasoning — the judge's explanation

Note: SimulationGoalsCriterion is designed exclusively for UserSimulator. Do not add it to a regular EvalConfig — in the standard AgentEvaluator flow, actual_response contains only the final response, not the full transcript, so the goal check would not see prior turns.


BatchSimulator

Run multiple scenarios concurrently with BatchSimulator. Each scenario gets its own isolated thread ID so checkpointer state never bleeds between runs.

from agentflow.qa.evaluation import (
BatchSimulator,
ConversationScenario,
SimulationGoalsCriterion,
CriterionConfig,
UserSimulator,
)

judge = SimulationGoalsCriterion(config=CriterionConfig(threshold=0.7))
simulator = UserSimulator(model="gemini/gemini-2.5-flash", criteria=[judge])

batch = BatchSimulator(simulator=simulator, max_concurrency=5)

results = await batch.run_batch(graph, [scenario_a, scenario_b, scenario_c])

summary = batch.summary(results)
print(f"Completion rate: {summary['completion_rate']:.0%}")
print(f"Average turns: {summary['average_turns']:.1f}")

BatchSimulator parameters

ParameterDefaultDescription
simulatorauto-createdPre-configured UserSimulator; pass your own to include criteria
max_concurrency5Maximum scenarios running in parallel
**kwargsForwarded to UserSimulator if no simulator is given

Batch summary fields

FieldDescription
total_scenariosTotal number of scenarios run
completedScenarios where all goals were achieved
completion_ratecompleted / total_scenarios
total_goals_achievedSum of goals achieved across all scenarios
average_turnsMean turns per scenario
errorsNumber of scenarios that errored

Complete programmatic example

import asyncio
from agentflow.qa.evaluation import (
BatchSimulator,
ConversationScenario,
CriterionConfig,
SimulationGoalsCriterion,
UserSimulator,
)
from graph.agent import app # your compiled graph


async def run_simulation():
judge = SimulationGoalsCriterion(
config=CriterionConfig(threshold=0.7, judge_model="gemini-2.5-flash")
)
simulator = UserSimulator(
model="gemini/gemini-2.5-flash",
temperature=0.6,
criteria=[judge],
)
batch = BatchSimulator(simulator=simulator, max_concurrency=3)

scenarios = [
ConversationScenario(
scenario_id="customer_refund",
description="Customer wants to initiate a refund for a damaged item.",
starting_prompt="I received a damaged item and want a refund.",
conversation_plan=(
"1. Explain the damage\n"
"2. Provide order details when asked\n"
"3. Confirm refund is processed"
),
goals=[
"Agent acknowledges the damage",
"Agent initiates or confirms a refund",
"Customer has a clear resolution",
],
max_turns=8,
),
ConversationScenario(
scenario_id="product_recommendation",
description="Customer wants a laptop recommendation for video editing.",
starting_prompt="I need a laptop for professional video editing.",
goals=[
"Agent asks about budget or requirements",
"Agent recommends at least one specific product",
"Recommendation includes RAM or GPU specs",
],
max_turns=6,
),
]

results = await batch.run_batch(app, scenarios)

for result, scenario in zip(results, scenarios):
status = "PASS" if result.completed else "FAIL"
n_goals = len(scenario.goals)
n_achieved = len(result.goals_achieved)
print(f"{result.scenario_id}: {status} | {result.turns} turns | {n_achieved}/{n_goals} goals")

sim_score = result.criterion_scores.get("simulation_goals")
if sim_score is not None:
print(f" SimulationGoals score: {sim_score:.2f}")
details = result.criterion_details.get("simulation_goals", {})
print(f" Achieved: {details.get('achieved_goals', [])}")
print(f" Missing: {details.get('unachieved_goals', [])}")

summary = batch.summary(results)
print(f"\nCompletion rate: {summary['completion_rate']:.0%}")


asyncio.run(run_simulation())

Simulation vs standard evaluation

Standard EvalSetUser simulation
InputFixed query stringLLM-generated messages
Expected outputDefined in the test caseInferred from stated goals
Turn countSingle turn (or explicit multi-turn)Dynamic, up to max_turns
Best forRegression testing known inputsOpen-ended dialogue and goal achievement
CLI protocolget_eval_set()get_scenarios()
ScoringPer-criterion scoresGoal achievement rate (0.0–1.0)

Run both in CI to get full coverage:

# All evals and simulations in one run
agentflow eval --parallel --max-concurrency 8

Next steps