Build a RAG AI agent in Python: done right

"Chat with my docs" is the most-prototyped, often-shipped, frequently-broken AI use case. Most teams ship a single retrieve-and-generate call, hit quality limits, and stall.

This page shows the production-shaped pattern: agentic RAG with hybrid retrieval, reranking, and forced citations.

Architecture at a glance

[ User question ]
       │
       ▼
[ Query understanding ] ── decompose into subqueries
       │
       ▼
[ Retrieval loop ]
   ├── hybrid_search (dense + keyword)
   ├── rerank (top-20 → top-5)
   └── enough?
       │
       ▼
[ Synthesize with citations ]
       │
       ▼
[ Validator: citations present, low hallucination ]

This is agentic RAG. The agent decides what to query, when to dig deeper, and when it has enough. It outperforms naive RAG on multi-hop questions and ambiguous queries.

Why this shape, not naive RAG

Naive RAG looks like:

context = vector_store.search(user_question)
response = llm.complete(f"Answer using context:\n{context}\n\nQ: {user_question}")

It works for "What does our pricing page say about enterprise?" It fails on:

"Compare our pricing to AcmeCorp's". Needs two retrievals
"What changed in our refund policy last quarter?". Needs filtering
"Does our docs explain how to set up SSO with Okta specifically?". Needs follow-up retrieval

Agentic RAG handles all three.

The retrieval tools

def hybrid_search(query: str, filters: str = "") -> str:
    """Search the knowledge base using both semantic and keyword retrieval.

    Args:
        query: Natural-language search query.
        filters: Optional JSON object with metadata filters, e.g. '{"product": "billing"}'.
    """
    f = json.loads(filters) if filters else {}

    dense = vector_client.search(query, top_k=10, filters=f)
    sparse = bm25_client.search(query, top_k=10, filters=f)

    # Reciprocal rank fusion
    fused = fuse_results(dense, sparse, k=20)
    reranked = rerank_client.rerank(query, fused, top_k=5)

    return "\n\n".join(
        f"[doc-{h.id}] {h.title}\n{h.text[:400]}" for h in reranked
    ) or "No matching documents."

def fetch_doc(doc_id: str) -> str:
    """Fetch the full content of a document by ID. Use after hybrid_search to read deeply."""
    doc = doc_store.get(doc_id)
    return doc.full_text[:15_000] if doc else f"Doc {doc_id} not found."

Notes:

Hybrid > pure dense. Dense retrieval misses exact-match queries; BM25 misses semantic queries. Combine.
Reranking is cheap and worth it. Cohere Rerank or a small cross-encoder. Top-5 reranked beats top-20 raw.
Filters in the tool signature. Lets the agent narrow by product, date, or audience.
Two-step retrieval. Search first, fetch full content only for the most promising hits.

The agent

from agentflow.core.graph import Agent, StateGraph, ToolNode

tool_node = ToolNode([hybrid_search, fetch_doc])

rag_agent = Agent(
    model="anthropic/claude-3-5-sonnet",
    system_prompt=[{"role": "system", "content": (
        "You answer questions using the company knowledge base. "
        "ALWAYS use hybrid_search first. Use fetch_doc to read full content for key sources. "
        "If the question has multiple parts, search for each separately. "
        "Cite every fact with [doc-N] from the search results. "
        "If you cannot find the answer, say so. Do not invent details."
    )}],
    tool_node="TOOL",
)

Indexing pipeline

The agent only works if the index is good. Quick checklist for indexing:

Chunk by structure, not fixed size. Markdown headers, HTML sections, paragraph breaks. See chunking and retrieval primitives.
Add metadata. Product, audience, last-modified date. Filters need them.
Both dense and sparse. Index for vector + BM25 from day one.
Reindex on doc changes. A daily diff is usually enough.

For an end-to-end indexing example with Qdrant, see the qdrant-memory tutorial.

When to call human

If the agent searches and finds nothing relevant, do not let it freelance an answer. Two patterns:

Surface the gap to the user. "I couldn't find this in our docs. You can ask in #support."
Escalate to a human agent. Especially for support flows. See the customer support use case.

Operational notes

Cache by canonical query. Multiple users asking the same thing should hit cache.
Per-tenant indexes for multi-tenant SaaS. Filtering by tenant_id is fine for small tenants; separate indexes for large ones.
Freshness. Stale docs are worse than no docs. Track last-indexed-at and warn when a doc is > 30 days old.
Cost. Reranking adds latency and cost; budget about 100ms and $0.001 per query.

Variants

Internal knowledge base. Confluence, Notion, Slack archives
Customer-facing docs. Your public documentation site
Support deflection. Old tickets + KB articles
Research over a corpus. Papers, contracts, reports

Same shape. Different documents and different filters.

Metrics that matter

Metric	Target
Recall@5 (manual eval)	> 80%
Citation coverage	> 95% of claims
Hallucination rate	< 3%
p95 latency	< 6 s
Cost per query	< $0.05

Common mistakes

One giant chunk. Chunks > 1500 tokens lose specificity. Aim for 200–500 tokens with overlap.
No reranking. Top-5 from hybrid search is usually OK; top-5 from dense alone misses too much.
No metadata filters. Without product, audience, tenant, the index is a soup.
Skipping the validator. Citations are the firewall against hallucination. Enforce them.

Architecture at a glance​

Why this shape, not naive RAG​

The retrieval tools​

The agent​

Indexing pipeline​

When to call human​

Operational notes​

Variants​

Metrics that matter​

Common mistakes​

Further reading​