Edictum
Edictum Console

Failure Modes

What happens when Postgres, Redis, SSE, or background workers fail. Component-by-component impact and recovery.

AI Assistance

Right page if: you need to understand what happens when a console component goes down -- database, cache, SSE connections, or background workers. Wrong page if: you need the core library's fail-closed behavior (contract evaluation errors) -- see https://docs.edictum.ai/docs/security/fail-closed. Gotcha: agents continue enforcing contracts locally when the console is unreachable. The console is for coordination and visibility, not enforcement. The most impactful failure is Postgres down -- audit events are lost until it recovers.

The console depends on Postgres, Redis, and background workers. Here is what happens when each fails.

Component Failure Matrix

Postgres Down

FeatureImpactRecovery
Health endpointReturns {"status": "degraded"}Automatic when Postgres recovers
Server startupHangs on Alembic migration (no timeout)Restart after Postgres is available
Audit event ingestionEvents from agents fail (500); events not persistedEvents during outage are lost
Approval workflowsCreating/deciding approvals failsPending approvals resume after recovery
Bundle deploymentCannot store or deploy bundlesResume after recovery
Dashboard auth (cookie)Works (sessions in Redis, not Postgres)N/A
Agent SSE connectionsStay open; no new bundles pushedConnections maintained; push resumes
Notification channelsExisting channel config works; changes failResume after recovery

Data loss risk: Audit events received during a Postgres outage are not persisted. Agents buffer up to 10,000 events locally; events beyond the buffer are dropped.

Redis Down

FeatureImpactRecovery
Health endpointReturns {"status": "degraded", "redis_connected": false}Automatic when Redis recovers
Dashboard sessionsAll active sessions lost; users must re-loginSessions recreated on next login
Rate limitingBypassed until Redis recoversRate limits reset on recovery
Agent SSE connectionsExisting connections stay open; new subscriptions may failReconnect after recovery
Approval workflowsUnaffected (DB-backed)N/A
Audit event ingestionUnaffected (DB-backed)N/A
Notification channelsUnaffected (DB-backed)N/A

Security risk: Rate limiting is bypassed during Redis outage. An attacker could brute-force login credentials during this window.

SSE Connection Drops

ScenarioBehaviorRecovery
Agent loses networkSSE stream breaks; agent keeps last-known contractsAgent auto-reconnects with exponential backoff (1s → 60s max)
Console restartsAll SSE connections dropAgents reconnect automatically
Queue overflow (>1000 events)Connection marked closed; warning loggedCleanup task removes stale connections every 5 minutes; agent reconnects
Agent receives stale contractsDrift detected in fleet monitoringTrigger redeployment from dashboard or wait for auto-reconnect

No data loss: SSE carries contract updates, not audit data. Agents send audit events via HTTP POST (separate from SSE).

Background Worker Crash

WorkerCheck intervalAuto-restartSide effect during outage
Approval timeout60s monitorYesPending approvals not expired for up to 60s; agents wait longer
Partition manager60s monitorYesEvent table may lack future partitions; writes may slow
Worker monitorNot monitoredManual restart requiredNeither worker auto-restarts if the monitor itself crashes

The _worker_monitor task checks every 60 seconds if workers have crashed (task.done() = True) and restarts them. If the monitor itself crashes, manual restart is required (redeploy the pod).

Notification Channel Failures

ChannelFailure modeRetry?Impact
Email (SMTP)Timeout or server downNoApproval notification silent
TelegramAPI down or invalid tokenNoApproval notification silent
SlackInvalid webhook or rate-limitedNoApproval notification silent
DiscordSignature validation failsNoWebhook disabled by Discord
WebhookSSRF detected or timeoutNoEvent not delivered

Channels fail independently. If one channel fails, others still fire. Approval workflows continue regardless of notification delivery — the approval sits in the queue until someone checks the dashboard or another channel delivers.

Graceful Degradation

The console is not required for enforcement. When the console is completely unreachable:

  1. Agents continue evaluating contracts locally using their last-known bundle
  2. Audit events buffer locally (up to 10,000 events per agent)
  3. Approval workflows time out according to timeout_effect (deny or allow)
  4. When the console comes back, agents reconnect and resume normal operation

The degradation chain for agents:

SSE (live updates) → In-memory (cached bundle) → Embedded YAML (fallback) → Deny all

Health Endpoints

EndpointReturnsUse for
GET /api/v1/healthFull status: database, Redis, workers, connected agentsDashboard monitoring, alerting
GET /api/v1/health/ready200 or 503 based on Postgres + Redis + workersKubernetes readiness probe
GET /api/v1/health/liveAlways 200 if process is aliveKubernetes liveness probe

Alert on: status: "degraded", any worker "crashed", redis_connected: false, database.connected: false.

Next Steps

Last updated on

On this page