Failure Modes

What happens when Postgres, Redis, SSE, or background workers fail. Component-by-component impact and recovery.

The console depends on Postgres, Redis, and background workers. Here is what happens when each fails.

Component Failure Matrix

Postgres Down

Feature	Impact	Recovery
Health endpoint	Returns `{"status": "degraded"}`	Automatic when Postgres recovers
Server startup	Hangs on Alembic migration (no timeout)	Restart after Postgres is available
Audit event ingestion	Events from agents fail (500); events not persisted	Events during outage are lost
Approval workflows	Creating/deciding approvals fails	Pending approvals resume after recovery
Bundle deployment	Cannot store or deploy bundles	Resume after recovery
Dashboard auth (cookie)	Works (sessions in Redis, not Postgres)	N/A
Agent SSE connections	Stay open; no new bundles pushed	Connections maintained; push resumes
Notification channels	Existing channel config works; changes fail	Resume after recovery

Data loss risk: Audit events received during a Postgres outage are not persisted. Agents buffer up to 10,000 events locally; events beyond the buffer are dropped.

Redis Down

Feature	Impact	Recovery
Health endpoint	Returns `{"status": "degraded", "redis_connected": false}`	Automatic when Redis recovers
Dashboard sessions	All active sessions lost; users must re-login	Sessions recreated on next login
Rate limiting	Bypassed until Redis recovers	Rate limits reset on recovery
Agent SSE connections	Existing connections stay open; new subscriptions may fail	Reconnect after recovery
Approval workflows	Unaffected (DB-backed)	N/A
Audit event ingestion	Unaffected (DB-backed)	N/A
Notification channels	Unaffected (DB-backed)	N/A

Security risk: Rate limiting is bypassed during Redis outage. An attacker could brute-force login credentials during this window.

SSE Connection Drops

Scenario	Behavior	Recovery
Agent loses network	SSE stream breaks; agent keeps last-known contracts	Agent auto-reconnects with exponential backoff (1s → 60s max)
Console restarts	All SSE connections drop	Agents reconnect automatically
Queue overflow (>1000 events)	Connection marked closed; warning logged	Cleanup task removes stale connections every 5 minutes; agent reconnects
Agent receives stale contracts	Drift detected in fleet monitoring	Trigger redeployment from dashboard or wait for auto-reconnect

No data loss: SSE carries contract updates, not audit data. Agents send audit events via HTTP POST (separate from SSE).

Background Worker Crash

Worker	Check interval	Auto-restart	Side effect during outage
Approval timeout	60s monitor	Yes	Pending approvals not expired for up to 60s; agents wait longer
Partition manager	60s monitor	Yes	Event table may lack future partitions; writes may slow
Worker monitor	Not monitored	Manual restart required	Neither worker auto-restarts if the monitor itself crashes

The _worker_monitor task checks every 60 seconds if workers have crashed (task.done() = True) and restarts them. If the monitor itself crashes, manual restart is required (redeploy the pod).

Notification Channel Failures

Channel	Failure mode	Retry?	Impact
Email (SMTP)	Timeout or server down	No	Approval notification silent
Telegram	API down or invalid token	No	Approval notification silent
Slack	Invalid webhook or rate-limited	No	Approval notification silent
Discord	Signature validation fails	No	Webhook disabled by Discord
Webhook	SSRF detected or timeout	No	Event not delivered

Channels fail independently. If one channel fails, others still fire. Approval workflows continue regardless of notification delivery — the approval sits in the queue until someone checks the dashboard or another channel delivers.

Graceful Degradation

The console is not required for enforcement. When the console is completely unreachable:

Agents continue evaluating contracts locally using their last-known bundle
Audit events buffer locally (up to 10,000 events per agent)
Approval workflows time out according to timeout_effect (deny or allow)
When the console comes back, agents reconnect and resume normal operation

The degradation chain for agents:

SSE (live updates) → In-memory (cached bundle) → Embedded YAML (fallback) → Deny all

Health Endpoints

Endpoint	Returns	Use for
`GET /api/v1/health`	Full status: database, Redis, workers, connected agents	Dashboard monitoring, alerting
`GET /api/v1/health/ready`	200 or 503 based on Postgres + Redis + workers	Kubernetes readiness probe
`GET /api/v1/health/live`	Always 200 if process is alive	Kubernetes liveness probe

Alert on: status: "degraded", any worker "crashed", redis_connected: false, database.connected: false.

Next Steps

Production checklist — pre-flight verification
Architecture — internal design and component details
Fail-closed guarantees — core library failure behavior

On this page