OpenBao HA/Raft dashboard

Use this explainer to read the generated OpenBao HA/Raft dashboard. It is for operators who need to inspect active-node state, Raft peer health, Autopilot signals, replication behavior, and Raft-related operational logs.

What this dashboard is for

Use the HA/Raft dashboard when the overview dashboard shows leader, seal, Autopilot, Raft peer, or storage symptoms.

The dashboard answers these questions:

How many Raft peers does OpenBao report?
Does Autopilot report the Raft cluster as healthy?
How many voter failures does Autopilot report as tolerable?
Which Raft nodes look unhealthy?
Did active-node or candidate transitions increase?
Are commit and applied indexes moving together?
Do operational logs mention Raft, storage, or Autopilot problems?

What this dashboard is not for

Do not use this dashboard as a Raft tuning guide. It shows symptoms and directional health signals. Use OpenBao configuration, server logs, and operational change history to explain why the signals changed.

Do not use this dashboard as a backup or recovery workflow. Use the relevant OpenBao integrated storage and snapshot procedures for those tasks.

Required data sources

The generated dashboard expects these Grafana data sources:

Data source	Expected UID	Used for
Prometheus	`prometheus`	OpenBao Raft, Autopilot, active, and seal metrics.
Loki	`loki`	Raft, storage, and Autopilot operational logs.

The dashboard works best with a private all-node scrape. Active-node scraping can show cluster-level health, but it limits standby and follower visibility.

Metric-prefix assumptions

Some panels use normalized openbao: recording rules. Other detailed Raft panels still query raw vault_raft_* source metrics because OpenBao exposes Raft internals as source metrics and the current contract has not normalized every detail series.

If your deployment emits openbao_* source metrics, adapt the dashboard contract or add compatible recording rules before you treat the raw Raft panels as complete.

How to read quorum health

Start with the first row:

Panel	Healthy interpretation
Raft peers	The value matches the expected cluster peer count.
Autopilot health	The value reports healthy.
Failure tolerance	The value matches the number of voter failures your cluster can tolerate.
Unhealthy Raft nodes	The value stays at `0`.

Loss of failure tolerance changes incident severity. A cluster can remain available while losing the ability to tolerate another voter failure.

How to read active and transition signals

Active nodes must remain exactly 1. Unsealed nodes must match the expected running cluster size.

Leader and candidate transitions are churn signals. A nonzero value is not automatically an outage, but repeated transitions in a short window usually mean the cluster is losing stable leadership or detecting leader failures.

Compare transition spikes with operational logs, node restarts, network events, and storage latency.

How to read replication and storage panels

Commit index and applied index show Raft progress by peer. Healthy peers tend to move forward together. A peer that stops progressing, falls behind, or keeps large pending FSM work needs investigation.

Leader last-contact and replication heartbeat timers are latency signals where OpenBao exposes them. Rising values can point to network, disk, CPU, or peer health problems.

Append entries logs show Raft append-entry activity by peer. Use the panel as context for replication behavior, not as a standalone health check.

How to read Raft and storage logs

The bottom panel filters operational logs for Raft, storage, and Autopilot terms. Use it to correlate metric changes with server-side messages.

Operational logs are troubleshooting context. They are not audit records.

Common mistakes

Reading active-node scrape results as complete follower visibility.
Treating Autopilot health as the only Raft signal.
Ignoring loss of failure tolerance while the cluster still serves traffic.
Treating one leader transition as an outage without checking the surrounding time window.
Forgetting that detailed Raft panels currently query raw vault_raft_* source metrics.

Known limitations

The dashboard needs all-node scraping for the strongest per-node view.
Some detailed Raft panels use raw vault_raft_* metric names.
Missing raw Raft internals can make detail panels empty while normalized health panels still work.
Operational logs depend on log_stream="openbao.operational".
The dashboard does not replace OpenBao storage recovery procedures.

What’s next

Use Configure an all-node metrics scrape when you need standby and follower visibility.
Use OpenBao HA/Raft observability to understand leadership, quorum, Autopilot, and replication signals.
Use OpenBao HA/Raft metrics to connect dashboard panels to source metrics and recording rules.
Use OpenBao namespaces and scale dashboard when non-voter, read-replica, or namespace context matters.
Use OpenBao operational logs dashboard when the log panels need deeper process context.
Use OpenBao Raft and Autopilot health when Raft or Autopilot alerts fire.
Use No active OpenBao leader when the cluster has no active node.
Use Multiple active OpenBao nodes when more than one node reports active.

Source: OpenBao documents integrated storage, Raft, and Autopilot behavior in the OpenBao integrated storage documentation and OpenBao Raft storage configuration documentation . This page describes the generated dashboard contract in contracts/dashboards/openbao-ha-raft.yaml.