Repair rollback failures without forcing a downgrade.
A failed rollback means blue-green automation stopped because continuing automatically could worsen Raft safety or cluster availability. Start with the status surface and the last failed rollback Job, then decide whether the right next step is a retry, a controlled pause for manual repair, or a restore from backup.
Do not force spec.version back to an older release to escape the incident. Downgrades are blocked for a reason. Repair the rollback surface first, then either let the operator continue safely or move into the dedicated restore workflow.
Decision matrix
Choose the rollback recovery path
| Situation | Use this path | Why |
|---|---|---|
| The rollback failed for a transient reason and the cluster is healthy again. | Retry the rollback by acknowledging the current break-glass nonce. | The operator already knows what work to retry; you only need to confirm the cluster is safe to continue. |
| The cluster still needs manual Raft, workload, or infrastructure repair. | Pause reconciliation, repair manually, then resume with the nonce acknowledgment. | This keeps the operator from racing your repairs while you stabilize the cluster. |
| The cluster state is beyond safe rollback repair. | Restore from a known-good snapshot. | Restore is the explicit recovery path when continuing the rollback is no longer trustworthy. |
Inspect break glass and blue-green state
Inspect
Capture the rollback failure surface
kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.breakGlass}' | jq
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{.status.blueGreen.phase}{"\n"}{.status.blueGreen.lastJobFailure}{"\n"}'
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'
The expected break-glass pattern here is reason=RollbackConsensusRepairFailed while the blue-green phase is still in RollingBack.
Inspect the failed rollback job and live cluster state
Inspect
Inspect the rollback job, Pods, and Raft peers
kubectl get jobs -n <namespace> -l openbao.org/cluster=<name>
kubectl logs -n <namespace> job/<job-from-status>
kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers
Look for network isolation between blue and green Pods, stuck or sealed Pods, peer membership that no longer matches the intended rollback topology, or executor-job failures that prevented the rollback from completing.
If your repair requires deleting or restarting managed Pods and your admission policy expects the maintenance annotation, enable maintenance mode first and follow Run Planned Maintenance.
Apply the recovery path that matches the diagnosis
- Retry the rollback
- Pause and repair
- Restore from backup
Use this when the failure was transient and the cluster is healthy again.
Apply
Acknowledge the current break-glass nonce
kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
"breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'
Then watch the replacement rollback Job and status.blueGreen.phase until the rollback either completes or enters a new break-glass event with a new nonce.
Use this when the cluster needs manual repair before any automation should continue.
Apply
Pause reconciliation before manual repair
kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
"paused": true
}
}'
After you repair the cluster, resume reconciliation and acknowledge the current nonce in the same change:
Apply
Resume reconciliation after repair
kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
"paused": false,
"breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'
Use this when the rollback surface is no longer safe to repair in place.
- stop any further automation
- identify the last known-good snapshot
- follow Recover After Upgrade Restore
Reduce repeat rollback failures
- enable pre-upgrade snapshots with
spec.upgrade.preUpgradeSnapshot=trueorspec.upgrade.blueGreen.preUpgradeSnapshot=true - verify backup destination and backup auth before the upgrade window starts
- monitor
status.blueGreen.phase,status.blueGreen.lastJobFailure, and cluster health throughout the rollout
Continue with the right recovery path
This version tracks a prerelease build. Features and behavior may change before the next stable release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.