Skip to main content

Monitoring Standards

Monitoring must show whether a node is alive, in consensus with the network, useful to clients, and safe to operate. Every production deployment should export Prometheus metrics and have Grafana dashboards for chain health, RPC health, and host resources.

Golden signals

SignalWhat to measureWhy it matters
AvailabilityProcess up, health endpoint, RPC smoke successA running process is not enough; the service must answer real requests.
LatencyRPC p50/p95/p99, consensus API response time, disk latencyLatency rises before outright failures on overloaded nodes.
TrafficRPC request rate, WebSocket/subscription count, peer trafficBaseline traffic distinguishes user demand from abuse or replay storms.
ErrorsRPC error rate, HTTP 5xx, rejected subscriptions, client panic countError spikes reveal bad deploys, upstream bugs, and client misuse.
SaturationCPU, memory, file descriptors, disk usage, disk I/O, network I/OSaturation predicts missed slots, sync stalls, and slow RPC responses.

Chain node signals

MetricAlert directionNotes
Sync lagAlert when local height/checkpoint/slot lags network reference beyond the chain SLOCompare to at least one trusted external reference or a healthy peer in the fleet.
Peer countAlert when below the minimum for the node roleLow peers can isolate a node even when the process is healthy.
RPC latencyAlert on p95 and p99, not just averagesTrack by method where labels are available.
RPC error rateAlert on 5xx and chain-specific execution errors separatelySeparate user validation errors from node failures.
Disk growthAlert before volume exhaustionArchive and indexer nodes need separate capacity curves.
Reorg/fork indicatorsAlert when exposed by the clientTreat repeated reorgs or equivocation indicators as incident inputs.

:::tip Dashboard design Build dashboards from the operator question outward: "Is the fleet healthy?", "Is this node synced?", "Are users seeing RPC failures?", and "Which resource is saturated?" Avoid dashboards that only list raw exporter metrics. :::

Prometheus and Grafana baseline

Every node should provide:

  • A private metrics endpoint reachable only by Prometheus.
  • Labels for chain, network, node role, runtime, region, and instance.
  • Recording rules for sync lag, RPC latency buckets, RPC error ratio, and peer count.
  • Grafana panels for node state, RPC health, host resources, and recent deploy markers.
groups:
- name: node-rpc
rules:
- record: fp:rpc_error_ratio:5m
expr: sum(rate(rpc_requests_total{status=~"5.."}[5m])) by (chain, instance) / sum(rate(rpc_requests_total[5m])) by (chain, instance)
- alert: RpcErrorRateHigh
expr: fp:rpc_error_ratio:5m > 0.05
for: 10m
labels:
severity: page
annotations:
summary: RPC error rate is above 5% for {{ $labels.chain }} {{ $labels.instance }}

Alert quality rules

RuleRequirement
Page only on user or chain riskA page must require immediate human action. Everything else is ticket or dashboard.
Include the runbook linkAlerts should link to /operations/common-runbooks or a chain-specific operations page.
Use sustained windowsAvoid paging on one scrape failure unless it represents endpoint outage.
Test silence and routingMaintenance windows must silence expected alerts without hiding unrelated failures.

Minimum smoke tests

Monitoring should include active probes in addition to passive metrics:

# JSON-RPC style probe example.
curl -fsS "$RPC_URL" \
-H 'content-type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"web3_clientVersion","params":[]}' >/dev/null

Use chain-specific smoke methods from the developer interface pages, then feed the result into Prometheus blackbox exporter or an equivalent synthetic monitor.