Incident Response Runbook Model

This is the public version of the FP Validated incident loop. It describes the operating model without exposing private endpoints or command-level recovery details.

Incident loop

Signal detected
  → Alertmanager / Grafana webhook
  → FP Validated Guard ingestion
  → deduplication and state update
  → incident/report pipeline
  → operator notification
  → runbook execution
  → evidence and postmortem notes

Triage phases

Phase	Goal	Examples	External control surface
Detect	Confirm the signal is real	exporter down, chain stall, peer loss, API degradation	dashboard, alert, Guard event
Classify	Determine blast radius	one node, one chain, one cluster, management plane, external dependency	labels, chain matrix, deployment history
Stabilize	Reduce risk	stop unsafe signing, isolate bad endpoint, pause rollout, preserve evidence	ArgoCD, Kubernetes API, Guard action
Recover	Restore service safely	restart non-signing service, resync workload, fail back to known-good deployment, restore from approved backup	runbook-controlled Kubernetes/GitOps operation
Verify	Prove recovery	dashboard green, alert resolved, chain height progressing, peers healthy, no double-sign risk	Prometheus, Grafana, chain RPC, Guard report
Record	Preserve learning	incident report, timeline, action items, rule/runbook updates	evidence store and docs update

Validator-specific guardrails

Signing safety beats uptime.
Ambiguous active validator state means stop and investigate.
Do not auto-failover signing roles without an explicit runbook.
Secret or signer mismatch is a critical incident.
Recovery evidence must include both Kubernetes state and chain-level behavior.

Serverless-for-operators principle

Operators should not need routine SSH access to production nodes. Normal incident handling should happen through separated, auditable surfaces:

Need	Preferred surface
Check health	Grafana, Prometheus, Guard health endpoint.
Inspect deployment state	ArgoCD and Kubernetes API with scoped RBAC.
Roll back a web/service workload	ArgoCD image or manifest rollback.
Restart a non-signing workload	Guard/Kubernetes action with audit trail.
Investigate chain status	Chain-specific dashboard and RPC/API checks.
Handle signer ambiguity	Stop and escalate through validator runbook; avoid automated recovery.

Direct server access is a break-glass path, not the default operating model.

Public alert categories

Category	Typical response
Availability	Check pod/node/exporter/service health.
Chain liveness	Verify height, peers, consensus status, and upstream RPC.
Signing safety	Freeze automation and validate signer topology.
Performance	Inspect resource pressure, disk, network, and backlog.
Observability quality	Repair missing labels, scrape targets, or pipeline drop events.
Security/access	Rotate affected credentials and review access logs through private systems.

Incident loop​

Triage phases​

Validator-specific guardrails​

Serverless-for-operators principle​

Public alert categories​

Incident loop

Triage phases

Validator-specific guardrails

Serverless-for-operators principle

Public alert categories