AI Agent Memory in Python: Short-Term, Long-Term, and Checkpointing

March 23, 2026 · 6 min read

Building production AI agents in Python

"Memory" is doing a lot of work in agent literature. It can mean: chat history within a session, long-term semantic recall across sessions, persistent state for resumable runs, or all three. They are different problems with different solutions.

Here is the practical taxonomy and how to wire each one in Python.

Three kinds of memory, three tools

Kind	Lifespan	Mechanism
Short-term (chat history)	Within one thread	Checkpointer + `thread_id`
Long-term (semantic recall)	Across users / time	Vector store accessed as a tool
Resumable state (interrupts, replays)	Until cleanup	Checkpointer snapshot per node

Most teams need all three. Most "memory" libraries solve one and leave you to assemble the rest.

Short-term memory: threads and checkpointers

A thread is a single conversation. AgentFlow snapshots the entire graph state after every node, keyed by thread_id:

from agentflow.core.graph import Agent, StateGraph
from agentflow.core.state import AgentState, Message
from agentflow.storage.checkpointer import InMemoryCheckpointer
from agentflow.utils import END

agent = Agent(model="google/gemini-2.5-flash",
    system_prompt=[{"role": "system", "content": "Helpful assistant."}])

graph = StateGraph(AgentState)
graph.add_node("assistant", agent)
graph.set_entry_point("assistant")
graph.add_edge("assistant", END)

app = graph.compile(checkpointer=InMemoryCheckpointer())

# Turn 1
app.invoke(
    {"messages": [Message.text_message("My name is Alex.")]},
    config={"thread_id": "user-42"},
)
# Turn 2 — same thread_id pulls the prior history
app.invoke(
    {"messages": [Message.text_message("What's my name?")]},
    config={"thread_id": "user-42"},
)

Key properties:

No code changes when you swap the storage backend
Same thread_id = same conversation context
Automatic. You do not assemble the prompt yourself

Production storage

InMemoryCheckpointer is for development. Switch to Postgres + Redis with one import:

from agentflow.storage.checkpointer import PgCheckpointer

checkpointer = PgCheckpointer(
    db_url="postgresql+asyncpg://user:password@localhost/agentflow",
    redis_url="redis://localhost:6379/0",
)
app = graph.compile(checkpointer=checkpointer)

Postgres holds the durable history; Redis is used for fast hot-path access. See checkpointing concept and the production checkpointing guide for tuning.

Trimming long threads

Threads grow. At ~100 turns, you start paying for tokens you do not need. Two common patterns:

Sliding window. Keep last N turns, drop the rest
Summary head. Every 20 turns, replace the oldest 15 with a summary message

agent = Agent(
    model="google/gemini-2.5-flash",
    system_prompt=[{"role": "system", "content": "Helpful assistant."}],
    trim_context=True,  # AgentFlow's built-in trimming
)

For custom strategies, see State and messages.

Long-term memory: semantic recall

Threads cover "what did this user say earlier?" They do not cover "what did any user say about pricing in the last 90 days?" or "what did this user prefer 6 months ago?"

For that, expose a vector store as a tool:

from agentflow.core.graph import Agent, StateGraph, ToolNode

def remember(query: str) -> str:
    """Search long-term memory for relevant past interactions."""
    # qdrant / pgvector / pinecone / your choice
    hits = vector_client.search(query, top_k=3)
    return "\n".join(f"[{h.id}] {h.text}" for h in hits)

def write_to_memory(text: str, tags: str) -> str:
    """Store a notable fact for future recall."""
    vector_client.upsert(text=text, metadata={"tags": tags})
    return "Saved."

tool_node = ToolNode([remember, write_to_memory])
agent = Agent(
    model="google/gemini-2.5-flash",
    system_prompt=[{"role": "system", "content": (
        "Use `remember` to recall past info and `write_to_memory` "
        "to save important new facts. Cite memory IDs."
    )}],
    tool_node="TOOL",
)

The agent decides when to recall and when to write. This is materially different from naively dumping the entire vector index into context. It scales and stays auditable.

For a runnable example with Qdrant, see the qdrant-memory tutorial.

When to write vs when to query

A useful default:

Always query at the start of a conversation if the user is identified
Write when the agent encounters a stable fact (preference, identity, decision). Not raw chat
Tag aggressively so retrieval can filter by user, tenant, or topic

This keeps the long-term store as a curated knowledge base, not a chat dump.

Resumable runs: checkpointing for interrupts and replays

The same checkpointer that powers thread history also powers two other features:

Interrupt and resume

Pause the graph mid-flow, get human approval, then resume:

result = app.invoke(
    {"messages": [Message.text_message("Refund order #123 for $400.")]},
    config={"thread_id": "approval-1"},
)
# Graph paused at APPROVE node; surface state to a human

# Later — same thread_id resumes from the checkpoint
app.invoke(
    {"approval": True},
    config={"thread_id": "approval-1"},
)

Time travel / replays

Because every node-boundary state is snapshotted, you can replay a thread from any prior state. Useful for debugging "why did the agent choose tool X here?" without rerunning the whole conversation.

See the interrupts-and-resume tutorial for the details.

Putting it together: a memory architecture

A production agent typically wires:

┌──────────────────────────────────────────┐
│  Agent (LLM)                              │
│  ├── short-term: PgCheckpointer (thread)  │
│  ├── long-term: vector tool (Qdrant)      │
│  └── resumable: same PgCheckpointer       │
└──────────────────────────────────────────┘

Each layer has a different SLA:

Layer	Latency budget	Storage
Short-term	< 50 ms	Redis (cached) + Postgres
Long-term	< 200 ms	Vector DB (Qdrant / pgvector)
Resumable	< 1 s	Postgres

These are budgets, not guarantees. The checkpointer is on the hot path; the vector store is opt-in per query. Treat them differently when you scale.

Common mistakes

Storing everything in the vector store. It becomes noise. Curate.
Skipping thread_id in dev. Then surprised it doesn't remember in prod. Always pass thread_id from day one.
One vector index per user. Hard to maintain. Use one index with user_id metadata and filter at query time.
No expiry policy. Memory is a feature; stale memory is a bug. Add TTL on long-term entries.
Treating chat history as long-term memory. They are not the same. The full chat goes in the checkpointer; facts extracted from chat go in the vector store.

Three kinds of memory, three tools​

Short-term memory: threads and checkpointers​

Production storage​

Trimming long threads​

Long-term memory: semantic recall​

When to write vs when to query​

Resumable runs: checkpointing for interrupts and replays​

Interrupt and resume​

Time travel / replays​

Putting it together: a memory architecture​

Common mistakes​

Further reading​