OpenBao HA/Raft observability

Use this explainer to understand the signals that describe OpenBao HA and Raft health. It is for operators who need to reason about leadership, quorum, Autopilot, replication, and failure tolerance before responding to dashboard or alert signals.

Why this matters

OpenBao integrated storage uses Raft to keep cluster state consistent across servers. A healthy HA cluster is not only a set of running pods or processes. It needs stable leadership, enough voters for quorum, followers that can keep up, and storage that can commit and apply new log entries.

HA/Raft observability helps you detect loss of safety margin before the cluster becomes unavailable.

Mental model

Read HA/Raft health from four layers:

LayerWhat it tells you
OpenBao core stateWhether nodes are active, standby, sealed, or unsealed.
Raft quorum stateWhether enough voters exist to commit new log entries.
Replication stateWhether followers receive and apply leader log entries.
Operational contextWhether logs, restarts, storage, or network events explain metric changes.

One panel rarely explains the whole incident. Compare state, quorum, replication, and logs together.

OpenBao behavior

OpenBao integrated storage uses a consensus protocol based on Raft. The Raft peer set elects one leader. The leader is the active OpenBao node, and followers are standby nodes.

Raft needs quorum to commit new log entries. For a peer set of size n, quorum requires a majority. If quorum is unavailable, no new log entries can be committed and the cluster cannot make progress.

Autopilot manages quorum health by evaluating when nodes are healthy enough to be voters and by tracking unhealthy or dead nodes. OpenBao documents Autopilot as enabled by default for integrated storage.

Signals to observe

SignalWhat to inspect
Active node countExactly one active node exists.
Unsealed node countExpected nodes report as unsealed.
Raft peer countPeer count matches the expected cluster membership.
Autopilot healthAutopilot reports whether the Raft cluster is healthy.
Failure toleranceRemaining voter failures before quorum is at risk.
Node healthPer-node Autopilot health by Raft node ID.
Leader and candidate transitionsLeadership churn and election activity.
Commit and applied indexFollower progress and applied state.
Last-contact and heartbeat timersPeer communication latency where exposed.
Raft and storage logsServer-side explanation for metric changes.

How to interpret leadership

Healthy OpenBao HA has exactly one active node. Zero active nodes means the cluster cannot serve normal active-node traffic. More than one active node is a split-brain symptom and needs immediate response.

Leader and candidate transition counters show churn. A single transition can happen during planned maintenance. Repeated transitions point to unstable leadership, network problems, process restarts, or storage latency.

How to interpret quorum and failure tolerance

Failure tolerance is a safety-margin signal. A cluster with no remaining failure tolerance can still serve traffic, but the next voter failure can make the cluster unavailable.

Treat loss of failure tolerance as an operational event. It is often cheaper to fix voter health before quorum is lost than to recover after write progress stops.

How to interpret replication

Commit and applied index panels show Raft progress. Healthy peers tend to move forward together. A peer that falls behind, stops applying entries, or keeps a large pending FSM backlog needs investigation.

Replication latency can come from the network, disk, CPU pressure, storage contention, process stalls, or overloaded followers. Use platform telemetry and operational logs to identify the cause.

How to interpret Autopilot

Autopilot health gives a summarized Raft-cluster signal, but it is not the only HA signal. Combine it with peer count, failure tolerance, node health, leader transitions, and operational logs.

Recently joined nodes can pass through stabilization before becoming voters. During that period, a dashboard can show changing peer or health behavior without an outage.

Scrape profile implications

HA/Raft observability benefits from all-node scraping. Active-node scraping can show cluster-level state, but it limits visibility into standby and follower runtime behavior.

Use the private all-node profile when you need per-node Raft troubleshooting. Keep the all-node metrics path isolated because standby metrics access expands the metrics exposure surface.

Common mistakes

  • Treating pod readiness as Raft health.
  • Ignoring failure tolerance while the cluster still serves requests.
  • Assuming Autopilot health explains every Raft symptom.
  • Reading active-node scrape output as complete follower visibility.
  • Treating leader churn as harmless without checking storage and network context.
  • Adding or removing nodes without watching quorum and index progress.

Evidence basis

ClassificationMeaning in this project
Confirmed OpenBao docs behaviorOpenBao documents Raft leaders, followers, quorum, committed entries, Autopilot behavior, and failure tolerance.
Observed fixture behaviorThe OpenBao 2.5.4 HA fixture exercises active state, unseal state, peer metrics, Autopilot health, Raft storage stats, three voters, and one non-voter read replica.
Design decisionThis project separates HA/Raft dashboard interpretation from recovery procedures and treats all-node scraping as the richer diagnostics profile.
To validateProduction Raft labels, storage latency, network policy, node replacement procedures, and alert thresholds.

What’s next

Source: OpenBao documents integrated storage, Raft leaders, followers, quorum, and Autopilot in the OpenBao integrated storage documentation . OpenBao documents Raft configuration and Autopilot intervals in the OpenBao Raft storage configuration documentation . OpenBao documents Raft telemetry in the OpenBao Raft telemetry documentation .