OpenBao HA/Raft metrics

Use this explainer to understand the HA, Raft, and Autopilot metrics used by the generated dashboards and alerts. It is for operators who need to connect Raft concepts to concrete Prometheus series and recording rules.

Why this matters

HA/Raft metrics show whether OpenBao has stable leadership, enough unsealed nodes, enough Raft peers, healthy Autopilot state, and follower progress. They help you detect loss of failure tolerance before the cluster becomes unavailable.

These metrics are most useful with all-node scraping. Active-node scraping can show some cluster-level state, but it limits standby and follower visibility.

Metric groups

The HA/Raft dashboard uses three metric groups:

GroupExamplesPurpose
Core HA statecore_active, core_unsealedShow active and unsealed node state.
Normalized Raft healthraft_peers, autopilot_healthy, autopilot_failure_tolerance, autopilot_node_healthyShow quorum and Autopilot health.
Raw Raft internalsraft_storage_stats_*, raft_state_*, raft_leader_lastContact, raft_replication_*Show replication, leadership churn, and peer progress where available.

Core HA state

Source metricRecording ruleInterpretation
${p}_core_activeopenbao:core_active:sumNumber of nodes that report active. Healthy HA has exactly one.
${p}_core_unsealedopenbao:core_unsealed:sumNumber of nodes that report unsealed. Healthy value depends on expected cluster size.

${p} is the source prefix. Use vault for the OpenBao default or openbao when you configure metrics_prefix = "openbao".

Do not use unsealed node count as voter count. The value counts scraped OpenBao nodes that report unsealed state, which can include voters, non-voters, and nodes that are catching up.

Autopilot and peer health

Source metricRecording ruleInterpretation
${p}_raft_peersopenbao:raft_peers:maxMaximum Raft peer count where OpenBao exposes the source metric.
${p}_autopilot_healthyopenbao:autopilot_healthy:maxWhether Autopilot reports the cluster healthy.
${p}_autopilot_failure_toleranceopenbao:autopilot_failure_tolerance:maxNumber of voter failures Autopilot reports as tolerable.
${p}_autopilot_node_healthyopenbao:autopilot_node_healthy:minPer-node Autopilot health by node_id.

The normalized peer rule falls back to counting ${p}_raft_storage_stats_commit_index by peer_id when ${p}_raft_peers is not present. This keeps the dashboard useful across the OpenBao 2.5.4 fixture and live all-node scrape behavior observed in this repository.

The current fixture validates a topology with three voters plus one non-voter read replica. It observes ${p}_raft_peers as 4, Autopilot failure tolerance as 1, and autopilot_node_healthy with a node_id value for the read replica.

Raw Raft internals

Some detailed HA/Raft dashboard panels query raw vault_raft_* source metrics directly. These panels are intentionally more advanced because the current metric contract does not normalize every Raft detail series.

Raw metric familyInterpretation
vault_raft_state_leaderLeader transition activity over a time window.
vault_raft_state_candidateCandidate transition activity over a time window.
vault_raft_storage_stats_commit_indexCommit progress by peer.
vault_raft_storage_stats_applied_indexApplied progress by peer.
vault_raft_storage_stats_fsm_pendingPending finite-state-machine work by peer.
vault_raft_storage_stats_termCurrent Raft term by peer.
vault_raft_leader_lastContactLeader last-contact timer where exposed.
vault_raft_replication_heartbeatReplication heartbeat timer by peer.
vault_raft_replication_appendEntries_logsAppend-entry log activity by peer.

If your deployment emits openbao_* source metrics, update these raw panels or add additional normalized recording rules before you rely on them.

Read replicas and non-voters

OpenBao documents Raft non-voters as nodes that receive the data replication stream without participating in quorum. They can add read scalability, but they do not increase voter failure tolerance.

Read non-voter health with all-node scraping and Autopilot or peer-state signals. Keep quorum alerts voter-aware, and keep read-capacity alerts separate from voter quorum alerts.

The OpenBao 2.5.4 HA/Raft fixture starts one non-voter with retry_join_as_non_voter = true. The fixture verifies that the node can read a sample KV value, list mounts and auth methods, emit read-replica audit entries, and appear in Raft peer and Autopilot state as a non-voter.

Treat role-specific production thresholds as environment-specific. The fixture validates the label shape and basic read behavior, not your production traffic split or read-capacity target.

How to interpret the signals

Start with active node count and unsealed node count. If either value is wrong, other metrics can be stale, partial, or misleading.

Then check Autopilot health and failure tolerance. A cluster can still serve requests while failure tolerance is already exhausted.

Use commit index, applied index, and pending FSM work to inspect follower progress. A follower that stops moving with the rest of the cluster needs investigation.

Use leader and candidate transition metrics as churn signals. One transition during planned maintenance can be normal. Repeated transitions usually need network, storage, process, or platform investigation.

Scrape profile

Use all-node scraping when you need the strongest HA/Raft view. Active-node scraping is useful for secure cluster-level health, but it cannot fully describe follower runtime behavior.

The all-node scrape path needs isolation because standby metrics access depends on unauthenticated metrics access.

Common mistakes

  • Treating openbao:autopilot_healthy:max as the only Raft health signal.
  • Ignoring failure tolerance because the cluster still serves traffic.
  • Expecting raw vault_raft_* panels to work on an openbao_* source-prefix deployment without adaptation.
  • Reading active-node scrape output as complete follower visibility.
  • Counting all unsealed nodes as Raft voters.
  • Treating read-replica capacity alerts as voter quorum alerts.
  • Alerting on raw peer IDs without reviewing label cardinality and metadata exposure.

What’s next

Source: OpenBao documents Raft telemetry in the OpenBao Raft telemetry documentation . OpenBao documents integrated storage, quorum, and Autopilot in the OpenBao integrated storage documentation . This page also reflects the repository metric contract in contracts/metrics/openbao-core.yaml.