Repair quorum before you change Raft membership.
A no-leader incident is not one thing. Sometimes the cluster cannot elect because pods are crash-looping or the cluster port is blocked. Sometimes a dead peer still counts toward quorum. Only when those narrower fixes are exhausted should you move into manual quorum recovery.
Decision matrix
Match the failure before you repair it
| Signal | Start with | Why |
|---|---|---|
| Pods are crash-looping or never become ready. | Fix pod health, storage, or config first. | Quorum cannot recover while one or more members are not healthy enough to participate in Raft. |
raft list-peers hangs or returns transport errors. | Check network reachability and DNS on the cluster port. | A healthy Raft set still fails if members cannot talk to each other on the internal cluster address. |
| A failed or stale member still appears in the peer list. | Remove the dead peer from a healthy node. | The cluster may be counting a non-existent member toward quorum. |
| No node can form quorum and there is no healthy leader to remove peers from. | Use manual quorum recovery with peers.json as a last resort. | This is destructive and should be used only when automatic recovery is no longer possible. |
Diagram
No-leader decision flow
Check pod health first, then verify the cluster network, then decide whether this is a stale-peer cleanup or a manual quorum-recovery event.
Inspect pod health and the Raft view
Inspect
Capture pod and peer state
kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers
Run the Raft peer command from each healthy Pod. Different peer views between nodes usually point to a transport or membership problem rather than a generic application issue.
If Pods are CrashLoopBackOff, resolve the pod-level failure first. Quorum recovery does not help when the underlying member still cannot boot, mount storage, or load configuration.
Verify the cluster network path
Inspect
Check the Raft transport path
kubectl exec -n <namespace> -it <pod-a> -- nc -zv <pod-b>.<headless-service> 8201
kubectl exec -n <namespace> -it <pod-a> -- nslookup <pod-b>.<headless-service>
Use the actual cluster port if you changed it from the default. Failed DNS or blocked port 8201 is enough to make a healthy cluster look like a quorum failure.
Check these first when the transport path is broken:
NetworkPolicyrules that should allow cluster-to-cluster traffic- service and headless-service DNS resolution
- sidecar or mesh policies that may block direct Pod-to-Pod communication
Remove a stale peer when quorum is almost healthy
Use this path only when one healthy node can still answer Raft commands and the peer list clearly shows a dead member that should no longer count toward quorum.
Apply
Remove a dead peer from the Raft set
kubectl exec -n <namespace> -it <healthy-pod> -- \
bao operator raft remove-peer -id "<dead-peer-id>"
After removing the peer, re-run bao operator raft list-peers from the same node and watch the OpenBaoCluster conditions until a leader is present again.
Use manual quorum recovery only as a last resort
Creating peers.json forces one survivor to bootstrap a new one-node Raft cluster. If you pick a node that was behind, you can lose data that only existed on another member. Use this path only when normal quorum recovery is no longer possible.
Before you start:
- stop the operator so it does not race your manual changes
- choose the Pod with the most recent and trustworthy data
- record which Pods you plan to restart and how you will rejoin them afterward
Apply
Stop operator reconciliation before manual recovery
kubectl scale deploy <controller-deployment> -n <operator-namespace> --replicas=0
Configure
Create a one-node peers.json file
[
{
"id": "<survivor-pod-name>",
"address": "<survivor-pod-name>.<headless-service>:8201",
"non_voter": false
}
]
Save this file locally as peers.json, replacing the Pod name and address with the survivor you selected.
Apply
Copy peers.json to the survivor and restart it
kubectl cp peers.json <namespace>/<survivor-pod-name>:/bao/data/raft/peers.json
kubectl delete pod -n <namespace> <survivor-pod-name>
When the survivor returns:
- verify it now reports a leader-capable state
- unseal it if required
- delete the remaining Pods so they rejoin against the new leader
- restart the operator deployment
Verify
Resume normal operations after manual recovery
kubectl exec -n <namespace> -it <survivor-pod-name> -- bao operator raft list-peers
kubectl delete pod -n <namespace> <other-pod-1> <other-pod-2>
kubectl scale deploy <controller-deployment> -n <operator-namespace> --replicas=1
Close out the incident deliberately
If the operator entered break glass during the failure, inspect the break-glass status and only acknowledge it after quorum, sealing, and workload health are all stable again.
See Enter Safe Mode for the acknowledgment flow.
Continue with the right recovery path
This version tracks a prerelease build. Features and behavior may change before the next stable release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.