Use safe mode to stop risky automation and recover control.
Break glass or safe mode is the operator's explicit stop signal when continuing rollback automation could make availability or Raft safety worse. Use this page to inspect the break-glass state, stabilize the cluster, repair the failure, and only then let automation resume.
Use this runbook when
- the operator reports break glass or safe mode on the cluster
- rollback automation stopped because continuing automatically is unsafe
- you need a clear pause before restarting pods or modifying Raft state
- you are ready to acknowledge a nonce only after the underlying issue is fixed
Decision matrix
What safe mode means
| Signal | What the operator is doing | Why it matters |
|---|---|---|
| Risky automation halted | The operator stops the affected upgrade or rollback workflow instead of pushing forward blindly. | This is the point where a human has to evaluate whether the live cluster is still repairable. |
| status.breakGlass populated | The cluster status contains the reason, message, nonce, and suggested next checks. | You should diagnose from that status first instead of guessing which internal job failed. |
| Manual acknowledgment required | Automation stays paused until spec.breakGlassAck matches the current nonce. | Acknowledgment is the explicit signal that you have repaired the issue and accept resumed automation. |
Inspect the break-glass state
Inspect
Read the break-glass status
kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.breakGlass}' | jq
Output
Typical break-glass payload
{
"active": true,
"reason": "RollbackConsensusRepairFailed",
"message": "Rollback consensus repair Job upgrade-prod-cluster-rollback-retry-1 failed; manual intervention required.",
"nonce": "abc-123-def-456",
"steps": [
"Inspect rollback Job logs",
"Inspect pod status",
"Perform any required Raft recovery steps, then acknowledge the nonce"
]
}
The reason, message, and steps fields are the fastest way to decide whether you are looking at an upgrade rollback problem, a cleanup failure after rollback, a Raft recovery problem, or a broader cluster-health issue.
For blue-green rollback incidents, the most common reasons are:
RollbackConsensusRepairFailed: the operator could not complete the rollback repair path while the cluster was still inRollingBack.RollbackCleanupPeerRemovalFailed: the rollback itself converged far enough to enterRollbackCleanup, but the peer-removal cleanup job failed and automation stopped before stale green peers were safely removed.
Repair the underlying issue before you acknowledge
Start with the operator-visible status and the last failed job, then move into the narrower runbook that matches the cluster state.
Inspect
Capture the current failure surface
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{.status.blueGreen.lastJobFailure}{"\n"}'
kubectl logs -n <namespace> job/<job-from-status>
kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers
These commands tell you whether the failure is still centered on the rollback job, whether the Pods are actually healthy, and whether Raft membership is still coherent enough for a safe retry.
If you need to restart or delete managed Pods while admission policies require the openbao.org/maintenance=true signal, enable maintenance mode first and follow Run Planned Maintenance.
If the cluster needs a deeper incident path, move directly into the matching runbook instead of staying in generic safe mode:
Acknowledge and resume automation
Only acknowledge the nonce after the cluster is healthy enough for the operator to continue the paused workflow.
Apply
Acknowledge the current nonce
kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
"breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'
If the operator re-enters break glass later, it will issue a new nonce. Always use the current value from status.breakGlass.nonce, not a previously copied one.
Go deeper
You are reading the unreleased main docs. Use the version menu for the newest published release, or check the release notes for what is already out.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.