Production troubleshooting
This page focuses on issues that usually appear after an agent leaves local development: deployment failures, state drift, auth mismatches, and cross-service connectivity problems.
If you need a narrower troubleshooting guide, use the dedicated pages:
Production troubleshooting workflow
Issue: deployment starts but requests fail immediately
Symptoms
/pingworks but graph routes fail- first invoke request returns 500
- logs mention import or dependency errors
Likely causes
- graph import path is wrong in
agentflow.json - environment variables required by the graph are missing
- production image does not include all dependencies
Fix
- verify
python -c "from graph.react import app; print(app)" - verify deploy-time secrets are present
- verify the image or runtime installed all required Python packages
Issue: threads vanish after restart
Symptoms
- conversation history works until the process restarts
/v1/threadsbecomes empty after deployment recycle
Likely cause
InMemoryCheckpointeris still being used
Fix
- switch to a durable shared checkpointer such as
PgCheckpointer - verify restart behavior before re-releasing
Issue: one replica sees thread history and another does not
Symptoms
- state appears inconsistent across instances
- one request remembers context, the next does not
Likely cause
- instances are not sharing the same persistence backend
Fix
- point all replicas to the same Postgres/Redis-backed checkpointer
- confirm the same
thread_idis being used by the caller
Issue: auth works in curl but fails in browser clients
Symptoms
- curl with bearer token succeeds
- frontend requests fail or never send credentials
Likely causes
- browser client is not attaching the auth header
- proxy strips
Authorization - CORS configuration blocks browser requests
Fix
- inspect the browser network tab
- verify frontend client config
- verify proxy forwards
Authorization - verify
ORIGINSincludes the real frontend origin
Issue: production deployment exposes too much
Symptoms
/docsand/redocare publicly reachable- cross-origin browser access is broader than intended
Likely causes
DOCS_PATH/REDOCS_PATHstill enabledORIGINS=*still set
Fix
- disable docs endpoints or restrict exposure intentionally
- replace wildcard origins with explicit domains
Issue: requests time out only in production
Symptoms
- local requests are fine
- deployed requests are slow or timing out
Likely causes
- external tools or providers are slower in the deployed environment
- reverse proxy timeouts are too aggressive
- graph is making too many sequential calls
Fix
- inspect server logs for slow nodes or tools
- tune proxy timeout settings
- prefer streaming where appropriate
- reduce expensive tool-call chains if possible
Issue: agentflow play works locally but deployed users cannot connect
Symptoms
- local playground sessions are fine
- deployed frontend or shared users fail to connect reliably
Likely cause
agentflow playwas used as a testing tool, but the deployed system needs a proper hosted API endpoint and browser-safe networking setup
Fix
- deploy with
agentflow apibehind HTTPS and correct CORS/auth settings - treat
agentflow playas an interactive test path, not the deployment architecture
Secret redaction in logs
Debug logging may surface API keys, bearer tokens, signed URLs, or other credentials that appear in LLM request parameters and error messages. The agentflow.utils module provides helpers to redact common credential formats before they reach your log handlers.
Patterns redacted
mask_secrets redacts the following credential formats:
- OpenAI keys (
sk-...,sk-proj-...) - Google API keys (
AIza...) - GitHub tokens (
ghp_,gho_,ghu_,ghs_,ghr_) - Slack tokens (
xox...) - AWS access key IDs (
AKIA...) Bearer <token>values in Authorization headerskey=valuepairs where the key isapi_key,access_token,secret, orpassword- Signed-URL credential query parameters (
?token=,&sig=,&X-Amz-Signature=, etc.)
Quick setup
from agentflow.utils import install_secret_redaction
# Call once at application startup, after configuring your logging handlers.
install_secret_redaction() # covers the "agentflow" logger and its handlers
install_secret_redaction("root") # or cover the root logger
Attaching to a specific handler
For finer control, add SecretRedactionFilter directly to a handler. Handler-level filters apply to all loggers that propagate to that handler, including children:
import logging
from agentflow.utils import SecretRedactionFilter
handler = logging.StreamHandler()
handler.addFilter(SecretRedactionFilter())
logging.getLogger("agentflow").addHandler(handler)
Redacting arbitrary strings
from agentflow.utils import mask_secrets
safe_text = mask_secrets(some_string_that_may_contain_keys)
This is a defence-in-depth measure. Prefer not logging secrets in the first place. mask_secrets is a heuristic and may miss novel credential formats.
Quick production checklist
- confirm exact runtime command
- confirm active
agentflow.json - confirm environment variables in the live process
- confirm auth and CORS behavior from a real client
- confirm persistence with restart testing
- confirm proxy and network path
Related docs
What you learned
- How to troubleshoot production failures by separating runtime, config, network, auth, and persistence layers.
- Which failures are usually caused by development defaults leaking into production.