Skip to main content
Version: next

Use this runbook when

  • the cluster cannot elect or keep a leader
  • Raft commands time out or report no leader elected
  • pods are running but the cluster still cannot form quorum
  • you need to decide whether a stale peer, network break, or manual recovery path is required

Decision matrix

Match the failure before you repair it

Match the failure before you repair it.
SignalStart withWhy
raft list-peers hangs or returns transport errors.Check network reachability and DNS on the cluster port.A healthy Raft set still fails if members cannot talk to each other on the internal cluster address.
A failed or stale member still appears in the peer list.Remove the dead peer from a healthy node.The cluster may be counting a non-existent member toward quorum.
No node can form quorum and there is no healthy leader to remove peers from.Use manual quorum recovery with peers.json as a last resort.This is destructive and should be used only when automatic recovery is no longer possible.

Diagram

No-leader decision flow

Check pod health first, then verify the cluster network, then decide whether this is a stale-peer cleanup or a manual quorum-recovery event.

Inspect pod health and the Raft view

Inspect

Capture pod and peer state

bash

kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers

Run the Raft peer command from each healthy Pod. Different peer views between nodes usually point to a transport or membership problem rather than a generic application issue.

If Pods are CrashLoopBackOff, resolve the pod-level failure first. Quorum recovery does not help when the underlying member still cannot boot, mount storage, or load configuration.

Verify the cluster network path

Inspect

Check the Raft transport path

bash

kubectl exec -n <namespace> -it <pod-a> -- nc -zv <pod-b>.<headless-service> 8201
kubectl exec -n <namespace> -it <pod-a> -- nslookup <pod-b>.<headless-service>

Use the actual cluster port if you changed it from the default. Failed DNS or blocked port 8201 is enough to make a healthy cluster look like a quorum failure.

Check these first when the transport path is broken:

  • NetworkPolicy rules that should allow cluster-to-cluster traffic
  • service and headless-service DNS resolution
  • sidecar or mesh policies that may block direct Pod-to-Pod communication

Remove a stale peer when quorum is almost healthy

Use this path only when one healthy node can still answer Raft commands and the peer list clearly shows a dead member that should no longer count toward quorum.

Apply

Remove a dead peer from the Raft set

bash

kubectl exec -n <namespace> -it <healthy-pod> -- \
bao operator raft remove-peer -id "<dead-peer-id>"

After removing the peer, re-run bao operator raft list-peers from the same node and watch the OpenBaoCluster conditions until a leader is present again.

Use manual quorum recovery only as a last resort

Manual quorum recovery can discard newer state

Creating peers.json forces one survivor to bootstrap a new one-node Raft cluster. If you pick a node that was behind, you can lose data that only existed on another member. Use this path only when normal quorum recovery is no longer possible.

Before you start:

  • stop the operator so it does not race your manual changes
  • choose the Pod with the most recent and trustworthy data
  • record which Pods you plan to restart and how you will rejoin them afterward

Apply

Stop operator reconciliation before manual recovery

bash

kubectl scale deploy <controller-deployment> -n <operator-namespace> --replicas=0

Configure

Create a one-node peers.json file

json

[
{
"id": "<survivor-pod-name>",
"address": "<survivor-pod-name>.<headless-service>:8201",
"non_voter": false
}
]

Save this file locally as peers.json, replacing the Pod name and address with the survivor you selected.

Apply

Copy peers.json to the survivor and restart it

bash

kubectl cp peers.json <namespace>/<survivor-pod-name>:/bao/data/raft/peers.json
kubectl delete pod -n <namespace> <survivor-pod-name>

When the survivor returns:

  1. verify it now reports a leader-capable state
  2. unseal it if required
  3. delete the remaining Pods so they rejoin against the new leader
  4. restart the operator deployment

Verify

Resume normal operations after manual recovery

bash

kubectl exec -n <namespace> -it <survivor-pod-name> -- bao operator raft list-peers
kubectl delete pod -n <namespace> <other-pod-1> <other-pod-2>
kubectl scale deploy <controller-deployment> -n <operator-namespace> --replicas=1

Close out the incident deliberately

If the operator entered break glass during the failure, inspect the break-glass status and only acknowledge it after quorum, sealing, and workload health are all stable again.

See Enter Safe Mode for the acknowledgment flow.

Continue with the right recovery path

Next release documentation

You are reading the unreleased main docs. Use the version menu for the newest published release, or check the release notes for what is already out.

Was this page helpful?

Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.