Build a RAG AI agent in Python: done right
"Chat with my docs" is the most-prototyped, often-shipped, frequently-broken AI use case. Most teams ship a single retrieve-and-generate call, hit quality limits, and stall.
This page shows the production-shaped pattern: agentic RAG with hybrid retrieval, reranking, and forced citations.
Architecture at a glance
[ User question ]
│
▼
[ Query understanding ] ── decompose into subqueries
│
▼
[ Retrieval loop ]
├── hybrid_search (dense + keyword)
├── rerank (top-20 → top-5)
└── enough?
│
▼
[ Synthesize with citations ]
│
▼
[ Validator: citations present, low hallucination ]
This is agentic RAG. The agent decides what to query, when to dig deeper, and when it has enough. It outperforms naive RAG on multi-hop questions and ambiguous queries.
Why this shape, not naive RAG
Naive RAG looks like:
context = vector_store.search(user_question)
response = llm.complete(f"Answer using context:\n{context}\n\nQ: {user_question}")
It works for "What does our pricing page say about enterprise?" It fails on:
- "Compare our pricing to AcmeCorp's". Needs two retrievals
- "What changed in our refund policy last quarter?". Needs filtering
- "Does our docs explain how to set up SSO with Okta specifically?". Needs follow-up retrieval
Agentic RAG handles all three.
The retrieval tools
def hybrid_search(query: str, filters: str = "") -> str:
"""Search the knowledge base using both semantic and keyword retrieval.
Args:
query: Natural-language search query.
filters: Optional JSON object with metadata filters, e.g. '{"product": "billing"}'.
"""
f = json.loads(filters) if filters else {}
dense = vector_client.search(query, top_k=10, filters=f)
sparse = bm25_client.search(query, top_k=10, filters=f)
# Reciprocal rank fusion
fused = fuse_results(dense, sparse, k=20)
reranked = rerank_client.rerank(query, fused, top_k=5)
return "\n\n".join(
f"[doc-{h.id}] {h.title}\n{h.text[:400]}" for h in reranked
) or "No matching documents."
def fetch_doc(doc_id: str) -> str:
"""Fetch the full content of a document by ID. Use after hybrid_search to read deeply."""
doc = doc_store.get(doc_id)
return doc.full_text[:15_000] if doc else f"Doc {doc_id} not found."
Notes:
- Hybrid > pure dense. Dense retrieval misses exact-match queries; BM25 misses semantic queries. Combine.
- Reranking is cheap and worth it. Cohere Rerank or a small cross-encoder. Top-5 reranked beats top-20 raw.
- Filters in the tool signature. Lets the agent narrow by product, date, or audience.
- Two-step retrieval. Search first, fetch full content only for the most promising hits.
The agent
from agentflow.core.graph import Agent, StateGraph, ToolNode
tool_node = ToolNode([hybrid_search, fetch_doc])
rag_agent = Agent(
model="anthropic/claude-3-5-sonnet",
system_prompt=[{"role": "system", "content": (
"You answer questions using the company knowledge base. "
"ALWAYS use hybrid_search first. Use fetch_doc to read full content for key sources. "
"If the question has multiple parts, search for each separately. "
"Cite every fact with [doc-N] from the search results. "
"If you cannot find the answer, say so. Do not invent details."
)}],
tool_node="TOOL",
)
Indexing pipeline
The agent only works if the index is good. Quick checklist for indexing:
- Chunk by structure, not fixed size. Markdown headers, HTML sections, paragraph breaks. See chunking and retrieval primitives.
- Add metadata. Product, audience, last-modified date. Filters need them.
- Both dense and sparse. Index for vector + BM25 from day one.
- Reindex on doc changes. A daily diff is usually enough.
For an end-to-end indexing example with Qdrant, see the qdrant-memory tutorial.
When to call human
If the agent searches and finds nothing relevant, do not let it freelance an answer. Two patterns:
- Surface the gap to the user. "I couldn't find this in our docs. You can ask in #support."
- Escalate to a human agent. Especially for support flows. See the customer support use case.
Operational notes
- Cache by canonical query. Multiple users asking the same thing should hit cache.
- Per-tenant indexes for multi-tenant SaaS. Filtering by
tenant_idis fine for small tenants; separate indexes for large ones. - Freshness. Stale docs are worse than no docs. Track last-indexed-at and warn when a doc is > 30 days old.
- Cost. Reranking adds latency and cost; budget about 100ms and $0.001 per query.
Variants
- Internal knowledge base. Confluence, Notion, Slack archives
- Customer-facing docs. Your public documentation site
- Support deflection. Old tickets + KB articles
- Research over a corpus. Papers, contracts, reports
Same shape. Different documents and different filters.
Metrics that matter
| Metric | Target |
|---|---|
| Recall@5 (manual eval) | > 80% |
| Citation coverage | > 95% of claims |
| Hallucination rate | < 3% |
| p95 latency | < 6 s |
| Cost per query | < $0.05 |
Common mistakes
- One giant chunk. Chunks > 1500 tokens lose specificity. Aim for 200–500 tokens with overlap.
- No reranking. Top-5 from hybrid search is usually OK; top-5 from dense alone misses too much.
- No metadata filters. Without
product,audience,tenant, the index is a soup. - Skipping the validator. Citations are the firewall against hallucination. Enforce them.