OpenBao HA/Raft observability

Use this explainer to understand the signals that describe OpenBao HA and Raft health. It is for operators who need to reason about leadership, quorum, Autopilot, replication, and failure tolerance before responding to dashboard or alert signals.

Why this matters

OpenBao integrated storage uses Raft to keep cluster state consistent across servers. A healthy HA cluster is not only a set of running pods or processes. It needs stable leadership, enough voters for quorum, followers that can keep up, and storage that can commit and apply new log entries.

HA/Raft observability helps you detect loss of safety margin before the cluster becomes unavailable.

Mental model

Read HA/Raft health from four layers:

Layer	What it tells you
OpenBao core state	Whether nodes are active, standby, sealed, or unsealed.
Raft quorum state	Whether enough voters exist to commit new log entries.
Replication state	Whether followers receive and apply leader log entries.
Operational context	Whether logs, restarts, storage, or network events explain metric changes.

One panel rarely explains the whole incident. Compare state, quorum, replication, and logs together.

OpenBao behavior

OpenBao integrated storage uses a consensus protocol based on Raft. The Raft peer set elects one leader. The leader is the active OpenBao node, and followers are standby nodes.

Raft needs quorum to commit new log entries. For a peer set of size n, quorum requires a majority. If quorum is unavailable, no new log entries can be committed and the cluster cannot make progress.

Autopilot manages quorum health by evaluating when nodes are healthy enough to be voters and by tracking unhealthy or dead nodes. OpenBao documents Autopilot as enabled by default for integrated storage.

Signals to observe

Signal	What to inspect
Active node count	Exactly one active node exists.
Unsealed node count	Expected nodes report as unsealed.
Raft peer count	Peer count matches the expected cluster membership.
Autopilot health	Autopilot reports whether the Raft cluster is healthy.
Failure tolerance	Remaining voter failures before quorum is at risk.
Node health	Per-node Autopilot health by Raft node ID.
Leader and candidate transitions	Leadership churn and election activity.
Commit and applied index	Follower progress and applied state.
Last-contact and heartbeat timers	Peer communication latency where exposed.
Raft and storage logs	Server-side explanation for metric changes.

How to interpret leadership

Healthy OpenBao HA has exactly one active node. Zero active nodes means the cluster cannot serve normal active-node traffic. More than one active node is a split-brain symptom and needs immediate response.

Leader and candidate transition counters show churn. A single transition can happen during planned maintenance. Repeated transitions point to unstable leadership, network problems, process restarts, or storage latency.

How to interpret quorum and failure tolerance

Failure tolerance is a safety-margin signal. A cluster with no remaining failure tolerance can still serve traffic, but the next voter failure can make the cluster unavailable.

Treat loss of failure tolerance as an operational event. It is often cheaper to fix voter health before quorum is lost than to recover after write progress stops.

How to interpret replication

Commit and applied index panels show Raft progress. Healthy peers tend to move forward together. A peer that falls behind, stops applying entries, or keeps a large pending FSM backlog needs investigation.

Replication latency can come from the network, disk, CPU pressure, storage contention, process stalls, or overloaded followers. Use platform telemetry and operational logs to identify the cause.

How to interpret Autopilot

Autopilot health gives a summarized Raft-cluster signal, but it is not the only HA signal. Combine it with peer count, failure tolerance, node health, leader transitions, and operational logs.

Recently joined nodes can pass through stabilization before becoming voters. During that period, a dashboard can show changing peer or health behavior without an outage.

Scrape profile implications

HA/Raft observability benefits from all-node scraping. Active-node scraping can show cluster-level state, but it limits visibility into standby and follower runtime behavior.

Use the private all-node profile when you need per-node Raft troubleshooting. Keep the all-node metrics path isolated because standby metrics access expands the metrics exposure surface.

Common mistakes

Treating pod readiness as Raft health.
Ignoring failure tolerance while the cluster still serves requests.
Assuming Autopilot health explains every Raft symptom.
Reading active-node scrape output as complete follower visibility.
Treating leader churn as harmless without checking storage and network context.
Adding or removing nodes without watching quorum and index progress.

Evidence basis

Classification	Meaning in this project
Confirmed OpenBao docs behavior	OpenBao documents Raft leaders, followers, quorum, committed entries, Autopilot behavior, and failure tolerance.
Observed fixture behavior	The OpenBao 2.5.4 HA fixture exercises active state, unseal state, peer metrics, Autopilot health, Raft storage stats, three voters, and one non-voter read replica.
Design decision	This project separates HA/Raft dashboard interpretation from recovery procedures and treats all-node scraping as the richer diagnostics profile.
To validate	Production Raft labels, storage latency, network policy, node replacement procedures, and alert thresholds.

What’s next

Use Active-node and all-node observability to choose the right scrape profile.
Use OpenBao HA/Raft metrics to connect HA and Raft concepts to concrete Prometheus series.
Use OpenBao HA/Raft dashboard to inspect the generated HA/Raft view.
Use OpenBao Raft and Autopilot health when HA/Raft alerts fire.
Use No active OpenBao leader when active node count drops to 0.
Use Multiple active OpenBao nodes when active node count is greater than 1.

Source: OpenBao documents integrated storage, Raft leaders, followers, quorum, and Autopilot in the OpenBao integrated storage documentation . OpenBao documents Raft configuration and Autopilot intervals in the OpenBao Raft storage configuration documentation . OpenBao documents Raft telemetry in the OpenBao Raft telemetry documentation .