Skip to main content
Version: next

Use this runbook when

  • the operator reports break glass or safe mode on the cluster
  • rollback automation stopped because continuing automatically is unsafe
  • you need a clear pause before restarting pods or modifying Raft state
  • you are ready to acknowledge a nonce only after the underlying issue is fixed

Decision matrix

What safe mode means

What safe mode means.
SignalWhat the operator is doingWhy it matters
status.breakGlass populatedThe cluster status contains the reason, message, nonce, and suggested next checks.You should diagnose from that status first instead of guessing which internal job failed.
Manual acknowledgment requiredAutomation stays paused until spec.breakGlassAck matches the current nonce.Acknowledgment is the explicit signal that you have repaired the issue and accept resumed automation.

Inspect the break-glass state

Inspect

Read the break-glass status

bash

kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.breakGlass}' | jq

Output

Typical break-glass payload

json

{
"active": true,
"reason": "RollbackConsensusRepairFailed",
"message": "Rollback consensus repair Job upgrade-prod-cluster-rollback-retry-1 failed; manual intervention required.",
"nonce": "abc-123-def-456",
"steps": [
"Inspect rollback Job logs",
"Inspect pod status",
"Perform any required Raft recovery steps, then acknowledge the nonce"
]
}

The reason, message, and steps fields are the fastest way to decide whether you are looking at an upgrade rollback problem, a cleanup failure after rollback, a Raft recovery problem, or a broader cluster-health issue.

For blue-green rollback incidents, the most common reasons are:

  • RollbackConsensusRepairFailed: the operator could not complete the rollback repair path while the cluster was still in RollingBack.
  • RollbackCleanupPeerRemovalFailed: the rollback itself converged far enough to enter RollbackCleanup, but the peer-removal cleanup job failed and automation stopped before stale green peers were safely removed.

Repair the underlying issue before you acknowledge

Start with the operator-visible status and the last failed job, then move into the narrower runbook that matches the cluster state.

Inspect

Capture the current failure surface

bash

kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{.status.blueGreen.lastJobFailure}{"\n"}'
kubectl logs -n <namespace> job/<job-from-status>
kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers

These commands tell you whether the failure is still centered on the rollback job, whether the Pods are actually healthy, and whether Raft membership is still coherent enough for a safe retry.

Use maintenance mode for controlled manual repair

If you need to restart or delete managed Pods while admission policies require the openbao.org/maintenance=true signal, enable maintenance mode first and follow Run Planned Maintenance.

If the cluster needs a deeper incident path, move directly into the matching runbook instead of staying in generic safe mode:

Acknowledge and resume automation

Only acknowledge the nonce after the cluster is healthy enough for the operator to continue the paused workflow.

Apply

Acknowledge the current nonce

bash

kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
"breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'

If the operator re-enters break glass later, it will issue a new nonce. Always use the current value from status.breakGlass.nonce, not a previously copied one.

Go deeper

Next release documentation

You are reading the unreleased main docs. Use the version menu for the newest published release, or check the release notes for what is already out.

Was this page helpful?

Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.