Hand off from cluster creation into upgrades, maintenance, and long-running operational work.
Day 2 starts once the cluster is initialized and the workload path is steady. From that point on, long-running operations such as upgrades and backups move through the admin operations path, while maintenance controls gate how much automation is allowed to continue during manual intervention.
At a glance
Starts with
- an initialized cluster with steady-state workload reconciliation
- version drift, backup schedules, or explicit maintenance requests
- operation lifecycle coordination available for lock and retry management
Primary owners
- adminops controller path
- internal/service/upgrade
- internal/service/backup and internal/service/opslifecycle
Writes
status.upgrade,status.blueGreen, and operation-lock state- upgrade and backup executor Jobs plus green revision resources when needed
- maintenance annotations or pause-driven no-op behavior depending on user intent
Hands off to
- backup and restore flows once a cluster needs ongoing durability
- troubleshooting and recovery guides when automation must pause
- steady-state workload reconciliation after an operation completes
Architectural Placement
Day 2 work is intentionally separated from the high-churn workload loop:
- Workload reconciliation continues to own the steady-state pod, Service, and config contract.
- Admin operations orchestration takes over when a change requires long-running coordination such as upgrade or backup.
internal/service/opslifecyclekeeps disruptive operations consistent around lock ownership, retry timing, and audit fields.
That separation prevents upgrades, backups, and other long-running workflows from blocking normal workload repair.
Diagram
Day 2 control-plane handoff
Once the cluster is live, disruptive operations route through the admin operations path instead of staying inside the high-churn workload controller.
Reference table
Day 2 operation families
| Operation family | Primary owner | Lifecycle role |
|---|---|---|
| Routine workload repair | Workload reconcile path. | Keeps StatefulSets, Services, ConfigMaps, and Secrets converged without entering the long-running adminops model. |
| Upgrade orchestration | Upgrade manager via adminops. | Handles version drift, strategy-specific state, and Raft-aware cutover logic. |
| Backup scheduling | Backup manager via adminops. | Runs snapshot Jobs and updates backup status without moving data through the controller. |
| Manual intervention gates | User-driven pause and maintenance settings. | Limit or reshape automation when an operator needs to intervene directly. |
- Rolling upgrades
- Blue-green upgrades
Rolling path
- Version drift triggers pre-upgrade validation around semver, health, and optional snapshot prerequisites.
- The upgrade manager uses StatefulSet partitioning and leader step-down to replace one pod at a time in reverse ordinal order.
- Progress is preserved in status so a failed step can stop cleanly and later resume from an explicit retry request.
- Completion updates currentVersion and clears the transient rolling-upgrade state once the workload fully converges.
Blue-green path
- A parallel green revision is created and joined as non-voters before any traffic cutover happens.
- Promotion, demotion, cleanup, and rollback all move through explicit phases stored in status.
- The Service selector changes only during cleanup, after a green leader is confirmed and blue peers are ready to leave.
- If rollback safety breaks down late, the manager enters break-glass instead of continuing risky automation blindly.
Reference table
Operational control surfaces
| Control | What it does | When to use it |
|---|---|---|
spec.paused=true | Short-circuits reconcilers so the operator stops mutating managed resources for the cluster. | Use when you need manual intervention and want automation to stop entirely. |
spec.maintenance.enabled=true | Keeps reconciliation running, but marks resources for controlled disruptive changes allowed by policy. | Use when the operator should continue known-safe automation during maintenance work. |
spec.breakGlassAck | Acknowledges an issued nonce before risky late-stage recovery automation can continue. | Use only after an operator has reviewed a break-glass condition and accepts the next step explicitly. |
Continue the lifecycle
This version tracks a prerelease build. Features and behavior may change before the next stable release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.