Incident Response

Incident response turns alerts into coordinated action. The goal is to restore safe service, protect users and keys, and leave an audit trail that improves the next response.

Severity levels

Severity	Definition	Examples	Response
SEV1	Critical production impact or security risk requiring immediate coordination	Widespread RPC outage, validator key exposure, corrupt production state, public admin endpoint exposure	Page incident commander, technical lead, security owner, and communications lead. Start timeline immediately.
SEV2	Material degradation with workaround or limited blast radius	One region down, elevated RPC errors, sync lag on part of fleet, failed upgrade affecting redundant nodes	Page on-call and chain owner. Escalate if user impact grows or no recovery path is clear.
SEV3	Minor production issue or urgent operational risk	Backup job failure, low peer count on non-critical node, dashboard regression	Ticket with owner and due date; page only if it threatens higher severity.
SEV4	Informational follow-up	Documentation correction, non-production issue, completed upstream advisory review	Track in normal work queue.

:::danger Security incidents If secrets, signing keys, admin credentials, or privileged endpoints may be exposed, treat the event as security-impacting until disproven. Preserve evidence, restrict access, and avoid posting secrets or exploit details in broad channels. :::

Incident roles

Role	Responsibility
Incident commander	Owns severity, coordination, decisions, and timeline hygiene.
Technical lead	Directs diagnosis and remediation. Keeps operators focused on one recovery path at a time.
Communications lead	Sends internal and external updates, status page notes, and stakeholder summaries.
Scribe	Records timeline, commands, observations, decisions, and links to dashboards or PRs.
Security owner	Leads containment when credentials, keys, data exposure, or abuse are possible.

One person can hold multiple roles for small incidents, but SEV1 incidents should split command, technical work, and communications.

Runbook structure during an incident

State the impact in user terms.
Declare severity and roles.
Freeze unrelated deploys for affected systems.
Capture baseline evidence: dashboards, logs, recent changes, alerts, and upstream status.
Choose the safest immediate mitigation: drain, rollback, rate-limit, disable endpoint, restore, or fail over.
Validate recovery with monitoring and real smoke tests.
Communicate status and next update timing.
Close only after impact has ended and follow-up owners are assigned.

# Example timeline entry format for incident notes.
2026-05-28T14:03Z SEV2 declared: Ethereum public JSON-RPC p95 latency above SLO in us-east.
2026-05-28T14:06Z Drained rpc-geth-03 from gateway; error ratio dropped from 8% to 1.2%.
2026-05-28T14:12Z Smoke test passed against remaining pool; monitoring continues.

Communications cadence

Audience	Content	Cadence
Operators	Current hypothesis, assigned actions, blockers, next decision	As work changes; avoid noisy speculation.
Internal stakeholders	Impact, severity, mitigation, next update	At declaration and material changes.
External users	User-visible impact, affected endpoints, workaround, recovery status	For SEV1/SEV2 user impact or contractual obligations.
Post-incident readers	Timeline, root cause, contributing factors, action items	After resolution.

:::tip Write for the next responder A good incident note lets another operator understand what happened without reconstructing Slack, dashboards, and terminal history. Include links, exact endpoint names, and validation evidence. :::

Post-incident review

Every SEV1 and SEV2 should produce a short review:

What happened and how users were affected.
Detection source and why it did or did not work.
Root cause and contributing factors.
What mitigated the incident.
What made response slower or riskier.
Action items with owners and verification criteria.

Action items should fix systems, runbooks, alerts, tests, or ownership gaps. Avoid action items that only ask people to be more careful.

Severity levels​

Incident roles​

Runbook structure during an incident​

Communications cadence​

Post-incident review​

Severity levels

Incident roles

Runbook structure during an incident

Communications cadence

Post-incident review