Skip to main content

Incident Response Runbook Model

This is the public version of the FP Validated incident loop. It describes the operating model without exposing private endpoints or command-level recovery details.

Incident loop

Signal detected
→ Alertmanager / Grafana webhook
→ FP Validated Guard ingestion
→ deduplication and state update
→ incident/report pipeline
→ operator notification
→ runbook execution
→ evidence and postmortem notes

Triage phases

PhaseGoalExamplesExternal control surface
DetectConfirm the signal is realexporter down, chain stall, peer loss, API degradationdashboard, alert, Guard event
ClassifyDetermine blast radiusone node, one chain, one cluster, management plane, external dependencylabels, chain matrix, deployment history
StabilizeReduce riskstop unsafe signing, isolate bad endpoint, pause rollout, preserve evidenceArgoCD, Kubernetes API, Guard action
RecoverRestore service safelyrestart non-signing service, resync workload, fail back to known-good deployment, restore from approved backuprunbook-controlled Kubernetes/GitOps operation
VerifyProve recoverydashboard green, alert resolved, chain height progressing, peers healthy, no double-sign riskPrometheus, Grafana, chain RPC, Guard report
RecordPreserve learningincident report, timeline, action items, rule/runbook updatesevidence store and docs update

Validator-specific guardrails

  • Signing safety beats uptime.
  • Ambiguous active validator state means stop and investigate.
  • Do not auto-failover signing roles without an explicit runbook.
  • Secret or signer mismatch is a critical incident.
  • Recovery evidence must include both Kubernetes state and chain-level behavior.

Serverless-for-operators principle

Operators should not need routine SSH access to production nodes. Normal incident handling should happen through separated, auditable surfaces:

NeedPreferred surface
Check healthGrafana, Prometheus, Guard health endpoint.
Inspect deployment stateArgoCD and Kubernetes API with scoped RBAC.
Roll back a web/service workloadArgoCD image or manifest rollback.
Restart a non-signing workloadGuard/Kubernetes action with audit trail.
Investigate chain statusChain-specific dashboard and RPC/API checks.
Handle signer ambiguityStop and escalate through validator runbook; avoid automated recovery.

Direct server access is a break-glass path, not the default operating model.

Public alert categories

CategoryTypical response
AvailabilityCheck pod/node/exporter/service health.
Chain livenessVerify height, peers, consensus status, and upstream RPC.
Signing safetyFreeze automation and validate signer topology.
PerformanceInspect resource pressure, disk, network, and backlog.
Observability qualityRepair missing labels, scrape targets, or pipeline drop events.
Security/accessRotate affected credentials and review access logs through private systems.