Skip to main content

Build a data extraction AI agent in Python

Pulling structured data out of unstructured text. Invoices, contracts, support tickets, scraped pages. Is one of the highest-ROI uses of LLMs. The trick is making the output reliable enough to ship to a database.

Here is the production architecture.

Architecture at a glance

[ Document text ]


[ Schema-aware LLM call ] ── enforces JSON shape


[ Validator ] ── pydantic / jsonschema

├─→ valid? → write to DB
└─→ invalid? → retry with error feedback (max 3)

This is mostly a workflow with a single agent loop for retries. See agents vs workflows.

Why this shape

  • Structured output is non-negotiable. A field that should be a date but came back as "soon" breaks downstream code.
  • Validation feedback loops produce 95%+ first-try success. Way better than parsing free text.
  • No tool calls needed in most cases. The LLM is just a structured-output engine.

The schema

from pydantic import BaseModel, Field
from datetime import date
from decimal import Decimal

class InvoiceLineItem(BaseModel):
description: str
quantity: int
unit_price: Decimal
total: Decimal

class Invoice(BaseModel):
invoice_number: str = Field(description="The invoice ID printed on the document")
issue_date: date
due_date: date
vendor_name: str
total_amount: Decimal
line_items: list[InvoiceLineItem]

Pydantic models double as the schema for the model and the validator for the result.

The extraction agent

from agentflow.core.graph import Agent, StateGraph, ToolNode
from agentflow.core.state import AgentState, Message
from agentflow.utils import END

def submit_invoice(invoice: dict) -> str:
"""Submit the extracted invoice. Call this once with the full structured data."""
try:
validated = Invoice.model_validate(invoice)
invoices_db.write(validated.model_dump())
return f"Saved invoice {validated.invoice_number}."
except Exception as e:
return f"Validation error: {e}. Fix the issues and resubmit."

extractor = Agent(
model="anthropic/claude-3-5-sonnet", # extraction benefits from larger models
system_prompt=[{"role": "system", "content": (
"Extract invoice data from the document. "
f"Use the submit_invoice tool with this schema: {Invoice.model_json_schema()}. "
"If submit_invoice returns a validation error, fix the issue and call submit_invoice again. "
"Do not summarize. Do not chat. Just extract and submit."
)}],
tool_node="TOOL",
)

tool_node = ToolNode([submit_invoice])

graph = StateGraph(AgentState)
graph.add_node("MAIN", extractor)
graph.add_node("TOOL", tool_node)

def route(state):
last = state.context[-1] if state.context else None
if last and getattr(last, "tools_calls", None) and last.role == "assistant":
return "TOOL"
if last and last.role == "tool":
# If submission succeeded, end. If it failed, loop back to MAIN.
if "Validation error" in last.text():
return "MAIN"
return END
return END

graph.add_conditional_edges("MAIN", route, {"TOOL": "TOOL", "MAIN": "MAIN", END: END})
graph.add_edge("TOOL", "MAIN")
graph.set_entry_point("MAIN")
app = graph.compile()

The agent submits, gets validated, fixes errors, resubmits. Usually in 1 or 2 iterations. Cap iterations with recursion_limit=5.

Running it

result = app.invoke(
{"messages": [Message.text_message(invoice_text)]},
config={"thread_id": f"extract-{invoice_id}", "recursion_limit": 5},
)

Operational notes

  • Use a model that supports structured output. Claude 3.5 Sonnet, GPT-4o, Gemini 1.5+ all do well. Smaller models miss nested structure.
  • Pre-process the document. Strip headers, footers, and page numbers before extraction. The model wastes attention on them otherwise.
  • Validate strictly. Pydantic + custom validators catch the long tail.
  • Log failures with the input. When extraction fails, you need the input to debug. Redact PII first.
  • Idempotency. Use thread_id = invoice_hash so duplicate documents do not create duplicate rows.

Variants of this pattern

  • Form filling. Extract user data from chat into a typed form
  • Email triage. Extract sender, intent, urgency, action items
  • Contract review. Extract terms, dates, parties from legal docs
  • Web scraping cleanup. Turn scraped HTML into typed records

The graph shape is the same. Only the schema changes.

Metrics that matter

MetricTarget
First-try validation success> 90%
Total success after retries> 99%
Cost per documentdepends on length, but plan < $0.05 for short docs
p95 latency< 8 s for typical documents

Further reading