OpenBao HA/Raft dashboard

Use this explainer to read the generated OpenBao HA/Raft dashboard. It is for operators who need to inspect active-node state, Raft peer health, Autopilot signals, replication behavior, and Raft-related operational logs.

What this dashboard is for

Use the HA/Raft dashboard when the overview dashboard shows leader, seal, Autopilot, Raft peer, or storage symptoms.

The dashboard answers these questions:

  • How many Raft peers does OpenBao report?
  • Does Autopilot report the Raft cluster as healthy?
  • How many voter failures does Autopilot report as tolerable?
  • Which Raft nodes look unhealthy?
  • Did active-node or candidate transitions increase?
  • Are commit and applied indexes moving together?
  • Do operational logs mention Raft, storage, or Autopilot problems?

What this dashboard is not for

Do not use this dashboard as a Raft tuning guide. It shows symptoms and directional health signals. Use OpenBao configuration, server logs, and operational change history to explain why the signals changed.

Do not use this dashboard as a backup or recovery workflow. Use the relevant OpenBao integrated storage and snapshot procedures for those tasks.

Required data sources

The generated dashboard expects these Grafana data sources:

Data sourceExpected UIDUsed for
PrometheusprometheusOpenBao Raft, Autopilot, active, and seal metrics.
LokilokiRaft, storage, and Autopilot operational logs.

The dashboard works best with a private all-node scrape. Active-node scraping can show cluster-level health, but it limits standby and follower visibility.

Metric-prefix assumptions

Some panels use normalized openbao: recording rules. Other detailed Raft panels still query raw vault_raft_* source metrics because OpenBao exposes Raft internals as source metrics and the current contract has not normalized every detail series.

If your deployment emits openbao_* source metrics, adapt the dashboard contract or add compatible recording rules before you treat the raw Raft panels as complete.

How to read quorum health

Start with the first row:

PanelHealthy interpretation
Raft peersThe value matches the expected cluster peer count.
Autopilot healthThe value reports healthy.
Failure toleranceThe value matches the number of voter failures your cluster can tolerate.
Unhealthy Raft nodesThe value stays at 0.

Loss of failure tolerance changes incident severity. A cluster can remain available while losing the ability to tolerate another voter failure.

How to read active and transition signals

Active nodes must remain exactly 1. Unsealed nodes must match the expected running cluster size.

Leader and candidate transitions are churn signals. A nonzero value is not automatically an outage, but repeated transitions in a short window usually mean the cluster is losing stable leadership or detecting leader failures.

Compare transition spikes with operational logs, node restarts, network events, and storage latency.

How to read replication and storage panels

Commit index and applied index show Raft progress by peer. Healthy peers tend to move forward together. A peer that stops progressing, falls behind, or keeps large pending FSM work needs investigation.

Leader last-contact and replication heartbeat timers are latency signals where OpenBao exposes them. Rising values can point to network, disk, CPU, or peer health problems.

Append entries logs show Raft append-entry activity by peer. Use the panel as context for replication behavior, not as a standalone health check.

How to read Raft and storage logs

The bottom panel filters operational logs for Raft, storage, and Autopilot terms. Use it to correlate metric changes with server-side messages.

Operational logs are troubleshooting context. They are not audit records.

Common mistakes

  • Reading active-node scrape results as complete follower visibility.
  • Treating Autopilot health as the only Raft signal.
  • Ignoring loss of failure tolerance while the cluster still serves traffic.
  • Treating one leader transition as an outage without checking the surrounding time window.
  • Forgetting that detailed Raft panels currently query raw vault_raft_* source metrics.

Known limitations

  • The dashboard needs all-node scraping for the strongest per-node view.
  • Some detailed Raft panels use raw vault_raft_* metric names.
  • Missing raw Raft internals can make detail panels empty while normalized health panels still work.
  • Operational logs depend on log_stream="openbao.operational".
  • The dashboard does not replace OpenBao storage recovery procedures.

What’s next

Source: OpenBao documents integrated storage, Raft, and Autopilot behavior in the OpenBao integrated storage documentation and OpenBao Raft storage configuration documentation . This page describes the generated dashboard contract in contracts/dashboards/openbao-ha-raft.yaml.