Skip to main content
Version: 0.1.0

Use this runbook when

  • a blue-green rollback enters break glass mode
  • the rollback consensus repair job failed and automation stopped
  • you need to decide whether the rollback can safely retry or needs manual repair
  • you need to restore from a known-good snapshot because rollback repair is no longer enough
Do not try to downgrade around the failure

Do not force spec.version back to an older release to escape the incident. Downgrades are blocked for a reason. Repair the rollback surface first, then either let the operator continue safely or move into the dedicated restore workflow.

Decision matrix

Choose the rollback recovery path

Choose the rollback recovery path.
SituationUse this pathWhy
The cluster still needs manual Raft, workload, or infrastructure repair.Pause reconciliation, repair manually, then resume with the nonce acknowledgment.This keeps the operator from racing your repairs while you stabilize the cluster.
The cluster state is beyond safe rollback repair.Restore from a known-good snapshot.Restore is the explicit recovery path when continuing the rollback is no longer trustworthy.

Inspect break glass and blue-green state

Inspect

Capture the rollback failure surface

bash

kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.breakGlass}' | jq
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{.status.blueGreen.phase}{"\n"}{.status.blueGreen.lastJobFailure}{"\n"}'
kubectl get openbaocluster <name> -n <namespace> \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} {.reason}{"\n"}{end}'

The expected break-glass pattern here is reason=RollbackConsensusRepairFailed while the blue-green phase is still in RollingBack.

Inspect the failed rollback job and live cluster state

Inspect

Inspect the rollback job, Pods, and Raft peers

bash

kubectl get jobs -n <namespace> -l openbao.org/cluster=<name>
kubectl logs -n <namespace> job/<job-from-status>
kubectl get pods -n <namespace> -l openbao.org/cluster=<name> -o wide
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers

Look for network isolation between blue and green Pods, stuck or sealed Pods, peer membership that no longer matches the intended rollback topology, or executor-job failures that prevented the rollback from completing.

Use maintenance mode before disruptive manual repair

If your repair requires deleting or restarting managed Pods and your admission policy expects the maintenance annotation, enable maintenance mode first and follow Run Planned Maintenance.

Apply the recovery path that matches the diagnosis

Use this when the failure was transient and the cluster is healthy again.

Apply

Acknowledge the current break-glass nonce

bash

kubectl patch openbaocluster <name> -n <namespace> --type merge -p '{
"spec": {
"breakGlassAck": "<NONCE_FROM_STATUS>"
}
}'

Then watch the replacement rollback Job and status.blueGreen.phase until the rollback either completes or enters a new break-glass event with a new nonce.

Reduce repeat rollback failures

  • enable pre-upgrade snapshots with spec.upgrade.preUpgradeSnapshot=true or spec.upgrade.blueGreen.preUpgradeSnapshot=true
  • verify backup destination and backup auth before the upgrade window starts
  • monitor status.blueGreen.phase, status.blueGreen.lastJobFailure, and cluster health throughout the rollout

Continue with the right recovery path

Published release documentation

You are reading docs for version 0.1.0. Use the version menu to switch to next or another archived release.

Was this page helpful?

Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.