Version: 0.4.x

Recover from failed rollback

A failed rollback means blue-green automation stopped because continuing automatically could worsen Raft safety or cluster availability. Start with the status surface and the last failed rollback Job, then decide whether the right next step is a retry, a controlled pause for manual repair, or a restore from backup.

Use this runbook when

a blue-green rollback enters break glass mode
the rollback consensus repair job failed and automation stopped
you need to decide whether the rollback can safely retry or needs manual repair
you need to restore from a recovery snapshot because rollback repair is no longer enough

Do not try to downgrade around the failure

Do not force spec.version back to an older release to escape the incident. Downgrades are blocked for a reason. Repair the rollback surface first, then either let the operator continue safely or move into the dedicated restore workflow.

Choose the rollback recovery path.
Situation	Use this path	Why
The rollback failed for a transient reason and the cluster is healthy again.	Retry the rollback by acknowledging the current break-glass nonce.	The operator already knows what work to retry; you only need to confirm the cluster is safe to continue.
The cluster still needs manual Raft, workload, or infrastructure repair.	Pause reconciliation, repair manually, then resume with the nonce acknowledgment.	This keeps the operator from racing your repairs while you stabilize the cluster.
The cluster state is beyond safe rollback repair.	Restore from a recovery snapshot.	Restore is the explicit recovery path when continuing the rollback is no longer trustworthy.

Inspect break glass and blue-green state

kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.breakGlass}' | jq
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{.status.blueGreen.phase}{"\n"}{.status.blueGreen.lastJobFailure}{"\n"}'
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'

The expected break-glass pattern here is reason=RollbackConsensusRepairFailed while the blue-green phase is still in RollingBack.

Break-glass reasons during rollback now map to two distinct failure surfaces:

RollbackConsensusRepairFailed usually means the rollback repair Job failed while the phase is still RollingBack.
RollbackCleanupPeerRemovalFailed usually means the cleanup Job that removes stale green peers failed while the phase is RollbackCleanup.

Inspect the failed rollback job and live cluster state

kubectl get jobs -n <namespace> -l openbao.org/cluster=<name>
kubectl logs -n <namespace> job/<job-from-status>
kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers

Look for network isolation between blue and green Pods, stuck or sealed Pods, peer membership that no longer matches the intended rollback topology, or executor-job failures that prevented the rollback from completing.

If the break-glass reason is RollbackCleanupPeerRemovalFailed, verify that no stale green peers remain in Raft membership before you acknowledge the nonce. A retry will create a fresh rollback-cleanup attempt, so the live peer list needs to match the rollback intent first.

Use maintenance mode before disruptive manual repair

If your repair requires deleting or restarting managed Pods and your admission policy expects the maintenance annotation, enable maintenance mode first and follow Run Planned Maintenance.

Apply the recovery path that matches the diagnosis

Retry the rollback
Pause and repair
Restore from backup

Use this when the failure was transient and the cluster is healthy again.

kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
  "breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'

Then watch the replacement rollback Job and status.blueGreen.phase until the rollback either completes or enters a new break-glass event with a new nonce.

Use this when the cluster needs manual repair before any automation should continue.

kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
  "paused": true
}
}'

After you repair the cluster, resume reconciliation and acknowledge the current nonce in the same change:

kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
  "paused": false,
  "breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'

Reduce repeat rollback failures

enable pre-upgrade snapshots with spec.upgrade.preUpgradeSnapshot=true or spec.upgrade.blueGreen.preUpgradeSnapshot=true
verify backup destination and backup auth before the upgrade window starts
monitor status.blueGreen.phase, status.blueGreen.lastJobFailure, and cluster health throughout the rollout

Continue with the right recovery path

Enter safe modeInspect and acknowledge the break-glass nonce only after the rollback surface is actually repaired.Recover after upgrade restoreUse the override-lock restore path when the rollback or upgrade state blocks the ordinary restore workflow.Run planned maintenanceUse the maintenance workflow when you need explicit, admission-safe manual repair windows.

Published release documentation

You are reading docs for version 0.4.x. Use the version menu to switch to next or another archived release.

Inspect break glass and blue-green state​

Inspect the failed rollback job and live cluster state​

Apply the recovery path that matches the diagnosis​

Reduce repeat rollback failures​

Inspect break glass and blue-green state

Inspect the failed rollback job and live cluster state

Apply the recovery path that matches the diagnosis

Reduce repeat rollback failures