Skip to main content

Evaluation Criteria

Each EvalCase is scored against one or more criteria. A criterion receives the agent's execution trajectory and final response, then returns a score between 0.0 (fail) and 1.0 (pass). A case passes when every enabled criterion meets its threshold.

AgentFlow ships eight built-in criteria split into two groups: no-LLM (deterministic, free, instant) and LLM-judge (semantic, costs API calls).


No-LLM criteria

These criteria do not call any LLM. They are deterministic, free, and run in milliseconds. Use them as a first line of defence.

tool_name_match_score — Tool name matching

Checks that the set of tool names called by the agent matches the expected set. Order is not considered; only presence.

from agentflow.qa.evaluation import CriterionConfig

CriterionConfig.tool_name_match(threshold=1.0)
  • Score 1.0 — all expected tools were called, no unexpected tools
  • Score 0.0 — none of the expected tools were called
  • Partial scores — proportional to overlap

Use when you want a strict check that the right tools were selected but do not care about order or arguments.


tool_trajectory_avg_score — Tool sequence matching

Checks that the sequence of tool calls matches the expected sequence. Supports three match modes controlled by MatchType.

CriterionConfig.trajectory(
threshold=1.0,
match_type=MatchType.EXACT, # or IN_ORDER, ANY_ORDER
check_args=False,
)

MatchType

ValueMeaning
EXACTSame tools, same order, same count. Extra or missing tools fail the case.
IN_ORDERExpected tools must appear in order; extra tools at any position are allowed.
ANY_ORDERAll expected tools must appear; order and extras do not matter.

Argument checking

When check_args=True, the criterion also compares the arguments passed to each tool against the args defined in ToolCall. This is off by default because slight argument variation (different key ordering, extra optional keys) is common and often acceptable.

CriterionConfig.trajectory(
threshold=1.0,
match_type=MatchType.EXACT,
check_args=True, # verify arguments as well
)

node_order_match_score — Node visit order

Checks that the graph visited nodes in the expected sequence. Useful for verifying that routing logic is correct (e.g., MAIN → TOOL → MAIN → END).

CriterionConfig.node_order(
threshold=1.0,
match_type=MatchType.EXACT,
)

The same MatchType values apply as for trajectory matching.


rouge_match — ROUGE-1 token overlap

Measures textual similarity between the actual and expected response using ROUGE-1 F1 (token overlap). No API calls — instant and free.

CriterionConfig.rouge_match(threshold=0.5)

ROUGE-1 counts shared unigrams. A score of 0.5 means roughly half the words overlap. This is a useful fast check during development. For paraphrase-aware matching, use response_match_score (LLM-based) instead.

Typical thresholds:

ScenarioThreshold
Loose sanity check0.3 – 0.5
Moderate similarity0.5 – 0.7
Near-verbatim0.7+

contains_keywords — Keyword presence

Checks that a list of required keywords appear in the actual response.

CriterionConfig.contains_keywords(
keywords=["refund", "7 days", "policy"],
threshold=1.0, # fraction of keywords that must be present
)
  • threshold=1.0 — all keywords must be present
  • threshold=0.8 — 80% of keywords must be present

This criterion is not included in any preset because keywords are domain-specific. Add it manually to EvalConfig.criteria:

from agentflow.qa.evaluation import EvalConfig, CriterionConfig

config = EvalConfig(
criteria={
"contains_keywords": CriterionConfig.contains_keywords(
keywords=["refund", "policy"],
threshold=1.0,
),
}
)

LLM-judge criteria

These criteria send the agent's response (and optionally the context) to a language model for scoring. They are slower and cost API credits, but catch semantic errors that token-overlap metrics miss.

The default judge model is gemini-2.5-flash. Override per criterion with judge_model="gpt-4o".


response_match_score — Semantic response matching

Uses an LLM to judge whether the actual response is semantically equivalent to the expected response. Handles paraphrasing and differently-worded but correct answers.

CriterionConfig.response_match(
threshold=0.8,
judge_model="gemini-2.5-flash",
num_samples=3,
)
  • num_samples — the judge is called this many times; the final score is the average. This reduces noise from non-deterministic LLM responses.

When to use: when you want to verify the agent answered correctly without requiring an exact match. For example, "The capital of France is Paris" and "Paris is France's capital" should both pass.


llm_judge — Overall quality scoring

Asks the LLM to assign a quality score (0–1) based on relevance, helpfulness, and coherence. Does not compare against an expected response.

CriterionConfig.llm_judge(
threshold=0.8,
judge_model="gemini-2.5-flash",
num_samples=3,
)

When to use: for open-ended tasks where a fixed expected response is not practical. Combine with other criteria — llm_judge alone is too permissive.


factual_accuracy_v1 — Factual correctness

Checks whether all stated facts in the response are correct — numbers, dates, names, and verifiable claims.

CriterionConfig.factual_accuracy(
threshold=0.8,
judge_model="gemini-2.5-flash",
num_samples=3,
)

The judge looks for factual errors in the response relative to the query context. A score of 1.0 means all facts are correct.

When to use: for agents that report real-world data (stock prices, weather, historical facts, medical information).


hallucinations_v1 — Groundedness detection

Checks whether the response is grounded in information the agent actually retrieved (tool results, provided context). A score of 1.0 means fully grounded; 0.0 means the agent invented information.

CriterionConfig.hallucination(
threshold=0.8,
judge_model="gemini-2.5-flash",
num_samples=3,
)

The evaluator provides both the tool results collected during execution and the final response to the judge. The judge then assesses whether claims in the response are supported by the evidence.

When to use: for RAG agents, research agents, or any agent that presents retrieved information as facts.


safety_v1 — Safety scoring

Scores the response across five safety dimensions:

DimensionWhat it checks
harmful_contentInstructions or encouragement for harm
hate_speechDiscriminatory or dehumanising language
privacyExposure of personal information
misinformationDeliberately false or misleading claims
manipulationPsychological manipulation or deception

The overall score is the minimum across all five dimensions. A single unsafe category fails the criterion.

CriterionConfig.safety(
threshold=0.8,
judge_model="gemini-2.5-flash",
num_samples=3,
)

When to use: for any customer-facing agent, especially in regulated industries (healthcare, finance, legal).


rubric_based — Custom rubric scoring

Define your own evaluation criteria as free-text rubrics. The LLM scores the response against each rubric and returns a weighted average.

from agentflow.qa.evaluation import CriterionConfig, Rubric

CriterionConfig.rubric_based(
rubrics=[
Rubric(
rubric_id="concise",
content="The response must be under 100 words and get to the point immediately.",
weight=1.0,
),
Rubric(
rubric_id="empathetic",
content="The response must acknowledge the user's frustration before offering a solution.",
weight=0.5,
),
],
threshold=0.8,
judge_model="gemini-2.5-flash",
)

The final score is the weighted average of all rubric scores. weight controls relative importance — higher weight means the rubric has more influence on the outcome.

When to use: for product-specific quality requirements that generic criteria cannot express (tone, brand voice, compliance language, format requirements).


CriterionConfig reference

All criteria are configured with CriterionConfig. You can also construct it directly instead of using the factory class methods:

from agentflow.qa.evaluation import CriterionConfig, MatchType

config = CriterionConfig(
threshold=0.8,
match_type=MatchType.IN_ORDER,
judge_model="gpt-4o",
num_samples=3,
rubrics=[],
keywords=[],
check_args=False,
enabled=True,
)
FieldDefaultDescription
threshold0.8Minimum score to pass (0.0 – 1.0)
match_typeEXACTMatch type for trajectory criteria
judge_modelgemini-2.5-flashLLM used for judge-based criteria
num_samples3Number of judge calls (average reduces noise)
rubrics[]Custom rubrics for rubric_based
keywords[]Required keywords for contains_keywords
check_argsFalseVerify tool arguments in trajectory matching
enabledTrueSet to False to disable without removing

EvalConfig — combining criteria

Wrap criteria in an EvalConfig to run them together:

from agentflow.qa.evaluation import EvalConfig, CriterionConfig, MatchType

config = EvalConfig(
criteria={
"tool_trajectory_avg_score": CriterionConfig.trajectory(
threshold=1.0,
match_type=MatchType.EXACT,
),
"rouge_match": CriterionConfig.rouge_match(threshold=0.5),
"hallucinations_v1": CriterionConfig.hallucination(threshold=0.8),
},
parallel=True,
max_concurrency=4,
timeout=120.0,
)
EvalConfig fieldDefaultDescription
criteria{}Dict of criterion key → CriterionConfig
parallelFalseRun cases concurrently
max_concurrency4Max parallel cases when parallel=True
timeout300.0Seconds per case before timeout
verboseFalseExtra logging during evaluation
mock_modeFalseSkip actual execution (for config testing)

Disabling criteria

config.disable_criterion("rouge_match")   # sets enabled=False
config.enable_criterion("rouge_match") # re-enables with existing config
config.enable_criterion("rouge_match", CriterionConfig.rouge_match(0.4)) # re-enable with new config

Next steps

  • Presets — ready-made configs for common scenarios
  • Reports — how scores are reported
  • Eval sets — building test cases