Presets & Configuration

EvalConfig defines which criteria run and at what thresholds. EvalPresets provides ready-made configs so you can get meaningful results without writing config from scratch.

Built-in presets

All presets are factory class methods on EvalPresets. Every preset that calls an LLM accepts an optional judge_model parameter (default: "gemini-2.5-flash").

quick_check — Fastest, no LLM

Uses ROUGE-1 token overlap only. No API calls, instant feedback. Good for a smoke test during active development.

from agentflow.qa.evaluation import EvalPresets

config = EvalPresets.quick_check()

Criteria included:

rouge_match (threshold 0.5)

tool_usage — Verify tool selection

Checks that the agent called the right tools with the right arguments. No LLM required.

config = EvalPresets.tool_usage(
    threshold=1.0,
    strict=True,      # EXACT match; set False for IN_ORDER
    check_args=True,
)

Criteria included:

tool_name_match_score — tool names match
tool_trajectory_avg_score — tool sequence matches

When to use: any agent that must call specific tools (weather, search, database lookups). This is the most common first-pass eval.

response_quality — Semantic accuracy

Uses an LLM to evaluate whether the agent's response is correct and relevant.

config = EvalPresets.response_quality(
    threshold=0.7,
    use_llm_judge=True,
    judge_model="gemini-2.5-flash",
)

Criteria included:

response_match_score — semantic match with expected response
llm_judge — overall quality score (when use_llm_judge=True)

When to use: Q&A agents, FAQ bots, or any agent where response accuracy matters but exact wording varies.

conversation_flow — Multi-turn quality

Validates both response quality and tool sequencing in multi-turn conversations.

config = EvalPresets.conversation_flow(
    threshold=0.8,
    judge_model="gemini-2.5-flash",
)

Criteria included:

response_match_score — semantic match (IN_ORDER tool matching)
tool_trajectory_avg_score — tool sequence with IN_ORDER matching

When to use: agents with multi-turn dialogue where the conversation follows a predictable flow.

safety_check — Safety and hallucination

Focused on what the agent outputs rather than whether it answers correctly.

config = EvalPresets.safety_check(
    threshold=0.8,
    judge_model="gemini-2.5-flash",
)

Criteria included:

hallucinations_v1 — groundedness check
safety_v1 — harmful content, hate speech, privacy, misinformation, manipulation

When to use: customer-facing agents, regulated industries, or any agent you are about to put in production.

comprehensive — All criteria

Runs every available criterion. Use this before a major release or for a thorough regression check.

config = EvalPresets.comprehensive(
    threshold=0.8,
    use_llm_judge=True,
    judge_model="gemini-2.5-flash",
)

Criteria included (no-LLM):

tool_name_match_score
tool_trajectory_avg_score
rouge_match

Criteria included (LLM-judge):

llm_judge
factual_accuracy_v1
hallucinations_v1
safety_v1

Note: contains_keywords is not included in comprehensive because keywords are domain-specific. Add it manually if needed (see Criteria — contains_keywords).

custom — Build from parameters

EvalPresets.custom() lets you enable exactly the criteria you need by passing threshold values. Any criterion whose threshold is None is excluded.

config = EvalPresets.custom(
    response_threshold=0.7,
    tool_threshold=1.0,
    llm_judge_threshold=None,           # excluded
    hallucination_threshold=0.8,
    safety_threshold=0.8,
    factual_accuracy_threshold=None,    # excluded
    tool_match_type=MatchType.IN_ORDER,
    check_tool_args=True,
    judge_model="gpt-4o",
)

combine — Merge presets

EvalPresets.combine() merges multiple configs. Later configs overwrite earlier ones when keys conflict.

config = EvalPresets.combine(
    EvalPresets.tool_usage(threshold=1.0),
    EvalPresets.safety_check(threshold=0.8),
)

EvalConfig built-in presets

EvalConfig also ships three class-method presets:

from agentflow.qa.evaluation import EvalConfig

config = EvalConfig.default()   # tool trajectory (EXACT) + response match (ROUGE)
config = EvalConfig.strict()    # EXACT trajectory with args + high-threshold LLM judge
config = EvalConfig.relaxed()   # IN_ORDER trajectory + lower response threshold

Preset	Trajectory	Response	Extra
`default()`	EXACT, threshold 1.0	ROUGE, threshold 0.8	—
`strict()`	EXACT + args, threshold 1.0	LLM judge, threshold 0.9	LLM judge, 5 samples
`relaxed()`	IN_ORDER, threshold 0.8	LLM judge, threshold 0.6	—

Save and load config

# Save to JSON
config.to_file("eval_config.json")

# Load from JSON
config = EvalConfig.from_file("eval_config.json")

Building a custom EvalConfig from scratch

from agentflow.qa.evaluation import EvalConfig, CriterionConfig, MatchType, Rubric

config = EvalConfig(
    criteria={
        # No-LLM: tool correctness
        "tool_trajectory_avg_score": CriterionConfig.trajectory(
            threshold=1.0,
            match_type=MatchType.EXACT,
            check_args=True,
        ),

        # No-LLM: response overlap
        "rouge_match": CriterionConfig.rouge_match(threshold=0.5),

        # LLM: semantic accuracy
        "response_match_score": CriterionConfig.response_match(
            threshold=0.8,
            judge_model="gemini-2.5-flash",
            num_samples=3,
        ),

        # LLM: hallucination
        "hallucinations_v1": CriterionConfig.hallucination(threshold=0.9),

        # LLM: domain-specific rubric
        "tone_check": CriterionConfig.rubric_based(
            rubrics=[
                Rubric(
                    rubric_id="professional_tone",
                    content=(
                        "The response must use professional, formal language. "
                        "Avoid colloquialisms and slang."
                    ),
                    weight=1.0,
                )
            ],
            threshold=0.8,
        ),

        # Keyword check
        "has_disclaimer": CriterionConfig.contains_keywords(
            keywords=["consult a professional", "not financial advice"],
            threshold=1.0,
        ),
    },
    parallel=True,
    max_concurrency=4,
    timeout=120.0,
)

Add rubrics to an existing config

from agentflow.qa.evaluation import Rubric

config = config.with_rubrics([
    Rubric(rubric_id="concise", content="Response must be under 50 words.", weight=1.0),
])

Choosing criteria for your use case

Agent type	Recommended criteria
Tool-calling (search, API)	`tool_usage` preset
Q&A / FAQ	`response_quality` preset
RAG / retrieval	`response_quality` + `hallucinations_v1`
Customer support	`response_quality` + `safety_check`
Multi-turn dialogue	`conversation_flow` preset
Pre-production gate	`comprehensive` preset
CI fast check	`quick_check` preset

Start fast (no-LLM criteria) and add LLM-judge criteria incrementally as your eval suite matures.

Next steps

Reports — how scores are presented in HTML, JSON, and JUnit XML
Criteria reference — full details on each criterion
Eval sets — building test cases

Built-in presets​

quick_check — Fastest, no LLM​

tool_usage — Verify tool selection​

response_quality — Semantic accuracy​

conversation_flow — Multi-turn quality​

safety_check — Safety and hallucination​

comprehensive — All criteria​

custom — Build from parameters​

combine — Merge presets​

EvalConfig built-in presets​

Save and load config​

Building a custom EvalConfig from scratch​

Add rubrics to an existing config​

Choosing criteria for your use case​

Next steps​

Built-in presets

quick_check — Fastest, no LLM

tool_usage — Verify tool selection

response_quality — Semantic accuracy

conversation_flow — Multi-turn quality

safety_check — Safety and hallucination

comprehensive — All criteria

custom — Build from parameters

combine — Merge presets

EvalConfig built-in presets

Save and load config

Building a custom EvalConfig from scratch

Add rubrics to an existing config

Choosing criteria for your use case

Next steps