Build a data extraction AI agent in Python

Pulling structured data out of unstructured text. Invoices, contracts, support tickets, scraped pages. Is one of the highest-ROI uses of LLMs. The trick is making the output reliable enough to ship to a database.

Here is the production architecture.

Architecture at a glance

[ Document text ]
       │
       ▼
[ Schema-aware LLM call ] ── enforces JSON shape
       │
       ▼
[ Validator ] ── pydantic / jsonschema
       │
       ├─→ valid? → write to DB
       └─→ invalid? → retry with error feedback (max 3)

This is mostly a workflow with a single agent loop for retries. See agents vs workflows.

Why this shape

Structured output is non-negotiable. A field that should be a date but came back as "soon" breaks downstream code.
Validation feedback loops produce 95%+ first-try success. Way better than parsing free text.
No tool calls needed in most cases. The LLM is just a structured-output engine.

The schema

from pydantic import BaseModel, Field
from datetime import date
from decimal import Decimal

class InvoiceLineItem(BaseModel):
    description: str
    quantity: int
    unit_price: Decimal
    total: Decimal

class Invoice(BaseModel):
    invoice_number: str = Field(description="The invoice ID printed on the document")
    issue_date: date
    due_date: date
    vendor_name: str
    total_amount: Decimal
    line_items: list[InvoiceLineItem]

Pydantic models double as the schema for the model and the validator for the result.

The extraction agent

from agentflow.core.graph import Agent, StateGraph, ToolNode
from agentflow.core.state import AgentState, Message
from agentflow.utils import END

def submit_invoice(invoice: dict) -> str:
    """Submit the extracted invoice. Call this once with the full structured data."""
    try:
        validated = Invoice.model_validate(invoice)
        invoices_db.write(validated.model_dump())
        return f"Saved invoice {validated.invoice_number}."
    except Exception as e:
        return f"Validation error: {e}. Fix the issues and resubmit."

extractor = Agent(
    model="anthropic/claude-3-5-sonnet",  # extraction benefits from larger models
    system_prompt=[{"role": "system", "content": (
        "Extract invoice data from the document. "
        f"Use the submit_invoice tool with this schema: {Invoice.model_json_schema()}. "
        "If submit_invoice returns a validation error, fix the issue and call submit_invoice again. "
        "Do not summarize. Do not chat. Just extract and submit."
    )}],
    tool_node="TOOL",
)

tool_node = ToolNode([submit_invoice])

graph = StateGraph(AgentState)
graph.add_node("MAIN", extractor)
graph.add_node("TOOL", tool_node)

def route(state):
    last = state.context[-1] if state.context else None
    if last and getattr(last, "tools_calls", None) and last.role == "assistant":
        return "TOOL"
    if last and last.role == "tool":
        # If submission succeeded, end. If it failed, loop back to MAIN.
        if "Validation error" in last.text():
            return "MAIN"
        return END
    return END

graph.add_conditional_edges("MAIN", route, {"TOOL": "TOOL", "MAIN": "MAIN", END: END})
graph.add_edge("TOOL", "MAIN")
graph.set_entry_point("MAIN")
app = graph.compile()

The agent submits, gets validated, fixes errors, resubmits. Usually in 1 or 2 iterations. Cap iterations with recursion_limit=5.

Running it

result = app.invoke(
    {"messages": [Message.text_message(invoice_text)]},
    config={"thread_id": f"extract-{invoice_id}", "recursion_limit": 5},
)

Operational notes

Use a model that supports structured output. Claude 3.5 Sonnet, GPT-4o, Gemini 1.5+ all do well. Smaller models miss nested structure.
Pre-process the document. Strip headers, footers, and page numbers before extraction. The model wastes attention on them otherwise.
Validate strictly. Pydantic + custom validators catch the long tail.
Log failures with the input. When extraction fails, you need the input to debug. Redact PII first.
Idempotency. Use thread_id = invoice_hash so duplicate documents do not create duplicate rows.

Variants of this pattern

Form filling. Extract user data from chat into a typed form
Email triage. Extract sender, intent, urgency, action items
Contract review. Extract terms, dates, parties from legal docs
Web scraping cleanup. Turn scraped HTML into typed records

The graph shape is the same. Only the schema changes.

Metrics that matter

Metric	Target
First-try validation success	> 90%
Total success after retries	> 99%
Cost per document	depends on length, but plan < $0.05 for short docs
p95 latency	< 8 s for typical documents

Architecture at a glance​

Why this shape​

The schema​

The extraction agent​

Running it​

Operational notes​

Variants of this pattern​

Metrics that matter​

Further reading​