Change workload versions without violating Raft safety.
The upgrade manager owns disruptive version changes. It keeps upgrade orchestration out of the workload loop, persists state in status so upgrades survive controller restarts, and prioritizes cluster availability over finishing quickly.
At a glance
Control path
- adminops reconciler
- internal/app/openbaocluster/adminops
- internal/service/upgrade/rolling and internal/service/upgrade/bluegreen
- shared seams in internal/service/upgrade/core, snapshot, and raftops
Owns
- strategy-specific rolling and blue-green phase orchestration
- shared lock, status, metrics, and root-lifecycle mechanics
- upgrade executor jobs, snapshot prerequisites, and Raft coordination
Writes
- status.upgrade, status.blueGreen, and status.breakGlass through shared status helpers
- partition changes, green revision resources, and executor jobs
- break-glass and failure state when rollback safety is compromised
Depends on
- target version policy and image alignment
- backup readiness and network egress for snapshot prerequisites
- operation lifecycle coordination for lock, retry, and phase timing
Architectural Placement
Upgrade execution belongs to the AdminOps orchestration path:
internal/controller/openbaoclusterreceives an adminops reconcile event.- The controller delegates to
internal/app/openbaocluster. - AdminOps orchestration invokes either the rolling or blue-green upgrade manager flow.
- Strategy packages delegate shared mechanics to
internal/service/upgrade/core,internal/service/upgrade/snapshot,internal/service/upgrade/raftops, andinternal/platform/statusapply.
That keeps upgrade state machines out of the workload loop and lets long-running transitions own their own retry model.
Package Shape
The upgrade subsystem is split so strategy packages keep workflow ownership while shared mechanics live behind narrower seams:
internal/service/upgradekeeps root helpers that are shared by both strategies but are not strategy-specific or executor-specific, such as request parsing, version and image policy, shared metrics types, pod client helpers, and root lifecycle helpers.internal/service/upgrade/rollingowns the rolling state machine: partition progression, leader step-down sequencing, per-pod rollout, convergence, and rolling-specific retry/failure handling.internal/service/upgrade/bluegreenowns the blue-green phase machine: green deployment, sync/promotion/cutover, rollback, and break-glass handling.internal/service/upgrade/coreowns shared lifecycle mechanics used by strategy code, including upgrade locks, common status mutators, metrics session bookkeeping, and blue-green status/state helpers that are not tied to a single phase.internal/service/upgrade/snapshotowns shared pre-upgrade snapshot preparation: prerequisite validation, runtime bootstrap, Job state modeling, and common existing-Job result handling.internal/service/upgrade/raftopsowns executor-side Raft and OpenBao coordination such as leader discovery, leader transfer, peer join/promote/demote/remove, and autopilot capability fallback.internal/platform/statusapplyowns the shared AdminOps status apply and merge-patch helpers so upgrade, backup, and adminops flows use the same status-subresource ownership rules.
Decision matrix
Strategy selection
| Strategy | Best fit | Primary tradeoff |
|---|---|---|
| Rolling update | Default upgrades with minimal extra infrastructure. | Lower resource cost, but each pod replacement must preserve Raft health and leader safety. |
| Blue-green | High-control cutovers with explicit promotion and rollback phases. | More orchestration and roughly double storage during the transition. |
- Rolling update
- Blue-green
Diagram
Rolling update flow
Rolling upgrades use StatefulSet partitioning and leader step-down so each pod can be replaced while Raft remains healthy.
Rolling safety controls
- StatefulSet partitioning pauses Kubernetes-driven rollout until the manager explicitly advances each ordinal.
- Reverse ordinal updates and forced leader step-down protect Raft availability during pod replacement.
- Finalization only happens after the StatefulSet revision and observed workload health fully converge.
Blue-green creates a second revision and needs roughly double storage capacity for the duration of the transition.
Diagram
Blue-green flow
Blue-green creates a parallel revision, promotes it through explicit phases, then switches traffic only after leadership and voter transitions are safe.
Blue-green safety controls
- The service selector switches to green only in cleanup.
- Manual promotion, manual rollback, and validation-hook failures all route through explicit phase handling in status.
- If rollback consensus repair fails late, the manager enters break-glass and stops risky automation.
State And Recovery Model
Reference table
Status-backed upgrade state
| State surface | What it preserves |
|---|---|
| status.upgrade | Rolling partition progress, completed pods, and finalization gating. |
| status.blueGreen.phase | The active blue-green phase and whether promotion, cleanup, or rollback is in progress. |
| lastErrorReason / lastErrorMessage | Why the current attempt failed and what must change before retry. |
| status.breakGlass | The nonce and diagnostic state when late rollback automation can no longer continue safely. |
Reference table
Safety boundaries
| Concern | Manager behavior |
|---|---|
| Availability over progress | Rolling pauses or retries when health is ambiguous; blue-green aborts early and rolls back later phases instead of forcing completion. |
| Version policy and image alignment | Invalid semantic versions, downgrades, and conflicting image/version inputs are rejected before orchestration begins. |
| Backup prerequisites | Snapshot prerequisites and backup authentication must already be valid before upgrade safety checks pass. |
| Atomic completion | Rolling finalization updates upgrade state and currentVersion together so status does not split across two truths. |
Related deep dives
You are reading the unreleased main docs. Use the version menu for the newest published release, or check the release notes for what is already out.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.