OpenAI

Run GPT-class models (gpt-4o, gpt-4o-mini) and reasoning models (o1, o3, o4-mini) through the OpenAI API.

Setup

Get an API key from platform.openai.com and export it:

export OPENAI_API_KEY="sk-..."

Or add it to a .env file:

OPENAI_API_KEY=sk-...

Basic usage

from agentflow.core.graph import Agent

agent = Agent(
    model="gpt-4o",
    provider="openai",
    system_prompt=[{"role": "system", "content": "You are a helpful assistant."}],
)

Full example with tools

from dotenv import load_dotenv

from agentflow.core.state import AgentState, Message
from agentflow.prebuilt.agent import ReactAgent

load_dotenv()


def get_weather(
    location: str,
    tool_call_id: str | None = None,
    state: AgentState | None = None,
) -> str:
    return f"The weather in {location} is sunny."


react_agent = ReactAgent(
    model="gpt-4o",
    provider="openai",
    system_prompt=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Use tools when they help answer the user.",
        }
    ],
    tools=[get_weather],
    trim_context=True,
)

if __name__ == "__main__":
    app = react_agent.compile()

    result = app.invoke(
        {"messages": [Message.text_message("What is the weather in New York City?")]},
        config={"thread_id": "openai-demo", "recursion_limit": 10},
    )

    for message in result["messages"]:
        print(message.role, message)

API Style

OpenAI exposes two distinct APIs for text generation. AgentFlow supports both.

`api_style`	Underlying call	When to use
`"chat"` (default)	`client.chat.completions.create`	All GPT and O-series models. Default for the Agent class.
`"responses"`	`client.responses.create`	Newer Responses API. Default for `SummaryContextManager` and the evaluation judge.

Agent

from agentflow.core.graph import Agent

# Default — Chat Completions
agent = Agent(model="gpt-4o", system_prompt=[...])

# Opt into the Responses API
agent = Agent(model="gpt-4o", api_style="responses", system_prompt=[...])

SummaryContextManager

from agentflow.core.state import SummaryContextManager

# Default is "responses" for the context manager
manager = SummaryContextManager(model="gpt-4o-mini", token_budget=8000)

# Older or third-party-hosted models that only support Chat Completions
manager = SummaryContextManager(
    model="gpt-4o-mini",
    api_style="chat",
    token_budget=8000,
)

Evaluation judge

The evaluation judge reads api_style from CriterionConfig.api_style (defaults to "responses"). Set it per-criterion when using a model that only supports Chat Completions:

from agentflow.qa.evaluation import CriterionConfig, EvalConfig, CriteriaConfig

config = EvalConfig(
    criteria=CriteriaConfig(
        llm_judge=CriterionConfig.llm_judge(
            judge_model="gpt-4o",
            api_style="chat",   # override if needed
        )
    )
)

Prompt Caching

OpenAI caches the prompt prefix automatically — no code changes required. Cache hits are reported back in usage.input_tokens_details.cached_tokens and logged at DEBUG level by AgentFlow. You only need to act if you want to improve hit rates.

How it works: OpenAI hashes the first N tokens of your request (system prompt + conversation history + tool definitions). Requests sharing an identical prefix are routed to a server that already has the KV cache in GPU memory. You always send the full prompt; OpenAI serves the cached computation.

Minimum size: 1,024 tokens. Requests below this threshold report zero cached tokens.

TTL: 5-10 minutes of inactivity (in-memory, volatile). Extended to 24 hours for gpt-5.5 and select gpt-5.x models with prompt_cache_retention="24h".

`prompt_cache_key`

When multiple Agent instances (or multiple processes) share the same long system prompt, pass a stable key to colocate them on the same cached server and raise the hit rate.

agent = Agent(
    model="gpt-4o",
    system_prompt=[{"role": "system", "content": very_long_prompt}],
    prompt_cache_key="legal-analyst-v2",  # passed through llm_kwargs
)

This is an OpenAI request-level parameter forwarded directly through llm_kwargs. It is not in CALL_EXCLUDED_KWARGS so it reaches the API unchanged.

`prompt_cache_retention`

Extends the cache TTL to 24 hours. Only effective on gpt-5.5 and select gpt-5.x models.

agent = Agent(
    model="gpt-5.5",
    system_prompt=[...],
    prompt_cache_key="assistant-v1",
    prompt_cache_retention="24h",
)

SummaryContextManager with caching

SummaryContextManager uses call_llm internally. Pass prompt_cache_key via **llm_kwargs — it is threaded through to the underlying OpenAI call.

SummaryContextManager does not currently accept **llm_kwargs directly. If you need cache keys on the summariser, subclass it or open an issue.

Evaluation judge with caching

The evaluation judge also calls call_llm. Extra kwargs are not yet forwarded from CriterionConfig to call_llm. The implicit cache still fires automatically when the judge prompt prefix is stable (same rubric, same model).

Reasoning Models

o1, o3, and o4-mini support extended thinking. Controlled via reasoning_config.

# Enable with defaults (effort="medium")
agent = Agent(model="o4-mini", reasoning_config=True)

# Set effort level
agent = Agent(
    model="o4-mini",
    reasoning_config={"effort": "high"},  # "low" | "medium" | "high"
)

# Disable (useful when falling back to a non-reasoning model)
agent = Agent(model="o4-mini", reasoning_config=False)

reasoning_config is not applicable to gpt-* models. The Agent ignores it for models that do not support extended thinking.

Structured Output

Force the model to return a Pydantic model by passing output_schema. The Agent routes this through beta.chat.completions.parse regardless of api_style.

from pydantic import BaseModel

class MyOutput(BaseModel):
    answer: str
    confidence: float

agent = Agent(
    model="gpt-4o",
    output_schema=MyOutput,
    system_prompt=[...],
)

Caching still applies to the beta.chat.completions.parse path — cache hits are logged the same way.

OpenAI-Compatible Endpoints

Any OpenAI-compatible server (ollama, vllm, LM Studio, etc.) can be used by setting base_url:

agent = Agent(
    model="llama3.2",
    base_url="http://localhost:11434/v1",
    api_style="chat",  # most local servers only support Chat Completions
)

When base_url is set and api_style="responses", the Agent tries the Responses API first and falls back to Chat Completions automatically if the server does not support it.

Prompt caching params (prompt_cache_key, prompt_cache_retention) have no effect on local servers unless the server explicitly implements them.

`llm_kwargs` Reference

All unrecognised keyword arguments passed to Agent(...) land in self.llm_kwargs and are forwarded to the underlying API call.

kwarg	Type	Applies to	Notes
`prompt_cache_key`	`str`	Chat + Responses	Improves cross-request cache hit rate
`prompt_cache_retention`	`"in_memory"` / `"24h"`	Chat + Responses	24h only on gpt-5.5+
`temperature`	`float`	Chat + Responses	Sampling temperature (0.0–2.0)
`max_tokens`	`int`	Chat	Max output tokens
`max_output_tokens`	`int`	Responses	Max output tokens (Responses API name)
`reasoning_effort`	`str`	Reasoning models	`"low"` / `"medium"` / `"high"`
`top_p`	`float`	Chat + Responses	Nucleus sampling
`frequency_penalty`	`float`	Chat	Penalise repeated tokens
`presence_penalty`	`float`	Chat	Penalise already-seen tokens

Keys in CALL_EXCLUDED_KWARGS (organization, project, timeout, max_retries, default_headers, default_query, http_client, api_key, base_url) are stripped before the request is sent and must be passed to the client constructor instead.

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	yes	API key from platform.openai.com

Common Errors

Error	Fix
`AuthenticationError`	`OPENAI_API_KEY` missing or invalid
`RateLimitError`	You hit a rate limit — enable retries via `retry_config=True`
`Model not found`	Check the model name; some models require tier-gated access

Setup​

Basic usage​

Full example with tools​

API Style​

Agent​

SummaryContextManager​

Evaluation judge​

Prompt Caching​

prompt_cache_key​

prompt_cache_retention​

SummaryContextManager with caching​

Evaluation judge with caching​

Reasoning Models​

Structured Output​

OpenAI-Compatible Endpoints​

llm_kwargs Reference​

Environment Variables​

Common Errors​