Monitoring Architecture

FP Validated monitoring is built around a repeatable node-agent pattern and a chain-aware alert taxonomy. The goal is to see infrastructure, Kubernetes, and chain health from outside the server, through controlled observability systems.

Node observability unit

Each monitored node can receive a standard observability bundle:

Component	Purpose	Public-safe signal examples
node-exporter	Host and node resource metrics.	CPU, memory, disk pressure, network saturation.
chain-exporter	Chain/runtime-specific metrics normalized for validator dashboards.	height progression, peers, consensus state, chain family labels.
log-guard	Log-based guardrail signals and pipeline health.	dropped records, error bursts, process-level health signals.

The monitoring repo uses per-node values and an ArgoCD ApplicationSet so onboarding a node is a declarative change, not manual dashboard wiring.

Alert rule organization

Alert rules are organized by scope:

Scope	Examples	Action style
Common/platform	exporter down, targets missing, labels missing, dropped records.	Restore observability or node health first.
Chain-specific	block stall, peer degradation, consensus stress, mempool backlog, storage-node health.	Follow the chain family runbook.
Security/guardrail	unexpected access path, secret delivery issue, signer mismatch indicator.	Freeze risky automation and escalate.
Bundle	deployable PrometheusRule packages combining common and chain rules.	Version and promote through GitOps.

Alert labels carry dimensions such as severity, service, category, family, chain, and network. This enables routing and public status summaries without exposing internals.

Dashboard strategy

Dashboards are layered:

FP Validated overview.
Validator overview.
Chain detail dashboards.
Kubernetes cluster health dashboards.
Node resource dashboards.

Datasources are templated so dashboards can move between environments. Managed dashboards are kept in Git; manual edits are governed so GitOps does not unexpectedly overwrite operator changes.

What operators can do without server access

Operators should be able to answer these questions from the external observability layer:

Is the node reachable and exporting metrics?
Is Kubernetes scheduling and restarting workloads normally?
Is the chain progressing in height?
Are peers or consensus health degraded?
Did a recent deployment correlate with new alerts?
Is an incident already deduplicated and tracked by Guard?

If the answer requires direct server shell access, the runbook should either add an observability signal or document a controlled break-glass path.

Node observability unit​

Alert rule organization​

Dashboard strategy​

What operators can do without server access​

Node observability unit

Alert rule organization

Dashboard strategy

What operators can do without server access