Failure Modes
What happens when Postgres, Redis, SSE, or background workers fail. Component-by-component impact and recovery.
Right page if: you need to understand what happens when a console component goes down -- database, cache, SSE connections, or background workers. Wrong page if: you need the core library's fail-closed behavior (contract evaluation errors) -- see https://docs.edictum.ai/docs/security/fail-closed. Gotcha: agents continue enforcing contracts locally when the console is unreachable. The console is for coordination and visibility, not enforcement. The most impactful failure is Postgres down -- audit events are lost until it recovers.
The console depends on Postgres, Redis, and background workers. Here is what happens when each fails.
Component Failure Matrix
Postgres Down
| Feature | Impact | Recovery |
|---|---|---|
| Health endpoint | Returns {"status": "degraded"} | Automatic when Postgres recovers |
| Server startup | Hangs on Alembic migration (no timeout) | Restart after Postgres is available |
| Audit event ingestion | Events from agents fail (500); events not persisted | Events during outage are lost |
| Approval workflows | Creating/deciding approvals fails | Pending approvals resume after recovery |
| Bundle deployment | Cannot store or deploy bundles | Resume after recovery |
| Dashboard auth (cookie) | Works (sessions in Redis, not Postgres) | N/A |
| Agent SSE connections | Stay open; no new bundles pushed | Connections maintained; push resumes |
| Notification channels | Existing channel config works; changes fail | Resume after recovery |
Data loss risk: Audit events received during a Postgres outage are not persisted. Agents buffer up to 10,000 events locally; events beyond the buffer are dropped.
Redis Down
| Feature | Impact | Recovery |
|---|---|---|
| Health endpoint | Returns {"status": "degraded", "redis_connected": false} | Automatic when Redis recovers |
| Dashboard sessions | All active sessions lost; users must re-login | Sessions recreated on next login |
| Rate limiting | Bypassed until Redis recovers | Rate limits reset on recovery |
| Agent SSE connections | Existing connections stay open; new subscriptions may fail | Reconnect after recovery |
| Approval workflows | Unaffected (DB-backed) | N/A |
| Audit event ingestion | Unaffected (DB-backed) | N/A |
| Notification channels | Unaffected (DB-backed) | N/A |
Security risk: Rate limiting is bypassed during Redis outage. An attacker could brute-force login credentials during this window.
SSE Connection Drops
| Scenario | Behavior | Recovery |
|---|---|---|
| Agent loses network | SSE stream breaks; agent keeps last-known contracts | Agent auto-reconnects with exponential backoff (1s → 60s max) |
| Console restarts | All SSE connections drop | Agents reconnect automatically |
| Queue overflow (>1000 events) | Connection marked closed; warning logged | Cleanup task removes stale connections every 5 minutes; agent reconnects |
| Agent receives stale contracts | Drift detected in fleet monitoring | Trigger redeployment from dashboard or wait for auto-reconnect |
No data loss: SSE carries contract updates, not audit data. Agents send audit events via HTTP POST (separate from SSE).
Background Worker Crash
| Worker | Check interval | Auto-restart | Side effect during outage |
|---|---|---|---|
| Approval timeout | 60s monitor | Yes | Pending approvals not expired for up to 60s; agents wait longer |
| Partition manager | 60s monitor | Yes | Event table may lack future partitions; writes may slow |
| Worker monitor | Not monitored | Manual restart required | Neither worker auto-restarts if the monitor itself crashes |
The _worker_monitor task checks every 60 seconds if workers have crashed (task.done() = True) and restarts them. If the monitor itself crashes, manual restart is required (redeploy the pod).
Notification Channel Failures
| Channel | Failure mode | Retry? | Impact |
|---|---|---|---|
| Email (SMTP) | Timeout or server down | No | Approval notification silent |
| Telegram | API down or invalid token | No | Approval notification silent |
| Slack | Invalid webhook or rate-limited | No | Approval notification silent |
| Discord | Signature validation fails | No | Webhook disabled by Discord |
| Webhook | SSRF detected or timeout | No | Event not delivered |
Channels fail independently. If one channel fails, others still fire. Approval workflows continue regardless of notification delivery — the approval sits in the queue until someone checks the dashboard or another channel delivers.
Graceful Degradation
The console is not required for enforcement. When the console is completely unreachable:
- Agents continue evaluating contracts locally using their last-known bundle
- Audit events buffer locally (up to 10,000 events per agent)
- Approval workflows time out according to
timeout_effect(deny or allow) - When the console comes back, agents reconnect and resume normal operation
The degradation chain for agents:
SSE (live updates) → In-memory (cached bundle) → Embedded YAML (fallback) → Deny allHealth Endpoints
| Endpoint | Returns | Use for |
|---|---|---|
GET /api/v1/health | Full status: database, Redis, workers, connected agents | Dashboard monitoring, alerting |
GET /api/v1/health/ready | 200 or 503 based on Postgres + Redis + workers | Kubernetes readiness probe |
GET /api/v1/health/live | Always 200 if process is alive | Kubernetes liveness probe |
Alert on: status: "degraded", any worker "crashed", redis_connected: false, database.connected: false.
Next Steps
- Production checklist — pre-flight verification
- Architecture — internal design and component details
- Fail-closed guarantees — core library failure behavior
Last updated on