Use safe mode to stop risky automation and recover control.
Break glass or safe mode is the operator's explicit stop signal when continuing rollback automation could make availability or Raft safety worse. Use this page to inspect the break-glass state, stabilize the cluster, repair the failure, and only then let automation resume.
Decision matrix
What safe mode means
| Signal | What the operator is doing | Why it matters |
|---|---|---|
| Risky automation halted | The operator stops the affected upgrade or rollback workflow instead of pushing forward blindly. | This is the point where a human has to evaluate whether the live cluster is still repairable. |
| status.breakGlass populated | The cluster status contains the reason, message, nonce, and suggested next checks. | You should diagnose from that status first instead of guessing which internal job failed. |
| Manual acknowledgment required | Automation stays paused until spec.breakGlassAck matches the current nonce. | Acknowledgment is the explicit signal that you have repaired the issue and accept resumed automation. |
Inspect the break-glass state
Inspect
Read the break-glass status
kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.breakGlass}' | jq
Output
Typical break-glass payload
{
"active": true,
"reason": "RollbackConsensusRepairFailed",
"message": "Rollback consensus repair Job upgrade-prod-cluster-rollback-retry-1 failed; manual intervention required.",
"nonce": "abc-123-def-456",
"steps": [
"Inspect rollback Job logs",
"Inspect pod status",
"Perform any required Raft recovery steps, then acknowledge the nonce"
]
}
The reason, message, and steps fields are the fastest way to decide whether you are looking at an upgrade rollback problem, a Raft recovery problem, or a broader cluster-health issue.
Repair the underlying issue before you acknowledge
Start with the operator-visible status and the last failed job, then move into the narrower runbook that matches the cluster state.
Inspect
Capture the current failure surface
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{.status.blueGreen.lastJobFailure}{"\n"}'
kubectl logs -n <namespace> job/<job-from-status>
kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers
These commands tell you whether the failure is still centered on the rollback job, whether the Pods are actually healthy, and whether Raft membership is still coherent enough for a safe retry.
If you need to restart or delete managed Pods while admission policies require the openbao.org/maintenance=true signal, enable maintenance mode first and follow Run Planned Maintenance.
If the cluster needs a deeper incident path, move directly into the matching runbook instead of staying in generic safe mode:
Acknowledge and resume automation
Only acknowledge the nonce after the cluster is healthy enough for the operator to continue the paused workflow.
Apply
Acknowledge the current nonce
kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
"breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'
If the operator re-enters break glass later, it will issue a new nonce. Always use the current value from status.breakGlass.nonce, not a previously copied one.
Go deeper
This version tracks a prerelease build. Features and behavior may change before the next stable release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.