Skip to main content

Incident Response

Incident response turns alerts into coordinated action. The goal is to restore safe service, protect users and keys, and leave an audit trail that improves the next response.

Severity levels

SeverityDefinitionExamplesResponse
SEV1Critical production impact or security risk requiring immediate coordinationWidespread RPC outage, validator key exposure, corrupt production state, public admin endpoint exposurePage incident commander, technical lead, security owner, and communications lead. Start timeline immediately.
SEV2Material degradation with workaround or limited blast radiusOne region down, elevated RPC errors, sync lag on part of fleet, failed upgrade affecting redundant nodesPage on-call and chain owner. Escalate if user impact grows or no recovery path is clear.
SEV3Minor production issue or urgent operational riskBackup job failure, low peer count on non-critical node, dashboard regressionTicket with owner and due date; page only if it threatens higher severity.
SEV4Informational follow-upDocumentation correction, non-production issue, completed upstream advisory reviewTrack in normal work queue.

:::danger Security incidents If secrets, signing keys, admin credentials, or privileged endpoints may be exposed, treat the event as security-impacting until disproven. Preserve evidence, restrict access, and avoid posting secrets or exploit details in broad channels. :::

Incident roles

RoleResponsibility
Incident commanderOwns severity, coordination, decisions, and timeline hygiene.
Technical leadDirects diagnosis and remediation. Keeps operators focused on one recovery path at a time.
Communications leadSends internal and external updates, status page notes, and stakeholder summaries.
ScribeRecords timeline, commands, observations, decisions, and links to dashboards or PRs.
Security ownerLeads containment when credentials, keys, data exposure, or abuse are possible.

One person can hold multiple roles for small incidents, but SEV1 incidents should split command, technical work, and communications.

Runbook structure during an incident

  1. State the impact in user terms.
  2. Declare severity and roles.
  3. Freeze unrelated deploys for affected systems.
  4. Capture baseline evidence: dashboards, logs, recent changes, alerts, and upstream status.
  5. Choose the safest immediate mitigation: drain, rollback, rate-limit, disable endpoint, restore, or fail over.
  6. Validate recovery with monitoring and real smoke tests.
  7. Communicate status and next update timing.
  8. Close only after impact has ended and follow-up owners are assigned.
# Example timeline entry format for incident notes.
2026-05-28T14:03Z SEV2 declared: Ethereum public JSON-RPC p95 latency above SLO in us-east.
2026-05-28T14:06Z Drained rpc-geth-03 from gateway; error ratio dropped from 8% to 1.2%.
2026-05-28T14:12Z Smoke test passed against remaining pool; monitoring continues.

Communications cadence

AudienceContentCadence
OperatorsCurrent hypothesis, assigned actions, blockers, next decisionAs work changes; avoid noisy speculation.
Internal stakeholdersImpact, severity, mitigation, next updateAt declaration and material changes.
External usersUser-visible impact, affected endpoints, workaround, recovery statusFor SEV1/SEV2 user impact or contractual obligations.
Post-incident readersTimeline, root cause, contributing factors, action itemsAfter resolution.

:::tip Write for the next responder A good incident note lets another operator understand what happened without reconstructing Slack, dashboards, and terminal history. Include links, exact endpoint names, and validation evidence. :::

Post-incident review

Every SEV1 and SEV2 should produce a short review:

  • What happened and how users were affected.
  • Detection source and why it did or did not work.
  • Root cause and contributing factors.
  • What mitigated the incident.
  • What made response slower or riskier.
  • Action items with owners and verification criteria.

Action items should fix systems, runbooks, alerts, tests, or ownership gaps. Avoid action items that only ask people to be more careful.