Hand off from cluster creation into upgrades, maintenance, and long-running operational work.
Day 2 starts once the cluster is initialized and the workload path is steady. From that point on, long-running operations such as upgrades and backups move through the admin operations path, while maintenance controls gate how much automation is allowed to continue during manual intervention.
At a glance
Starts with
- an initialized cluster with steady-state workload reconciliation
- version drift, backup schedules, or explicit maintenance requests
- operation lifecycle coordination available for lock and retry management
Primary owners
- adminops controller path
- internal/service/upgrade
- internal/service/backup and internal/service/opslifecycle
Writes
status.upgrade,status.blueGreen, and operation-lock state- upgrade and backup executor Jobs plus green revision resources when needed
- maintenance annotations or pause-driven no-op behavior depending on user intent
Hands off to
- backup and restore flows once a cluster needs ongoing durability
- troubleshooting and recovery guides when automation must pause
- steady-state workload reconciliation after an operation completes
Architectural Placement
Day 2 work is intentionally separated from the high-churn workload loop:
- Workload reconciliation continues to own the steady-state pod, Service, and config contract.
- Admin operations orchestration takes over when a change requires long-running coordination such as upgrade or backup.
internal/service/opslifecyclekeeps disruptive operations consistent around lock ownership, retry timing, and audit fields.
That separation prevents upgrades, backups, and other long-running workflows from blocking normal workload repair.
Diagram
Day 2 control-plane handoff
Once the cluster is live, disruptive operations route through the admin operations path instead of staying inside the high-churn workload controller.
Reference table
Day 2 operation families
| Operation family | Primary owner | Lifecycle role |
|---|---|---|
| Routine workload repair | Workload reconcile path. | Keeps StatefulSets, Services, ConfigMaps, and Secrets converged without entering the long-running adminops model. |
| Upgrade orchestration | Upgrade manager via adminops. | Handles version drift, strategy-specific state, and Raft-aware cutover logic. |
| Backup scheduling | Backup manager via adminops. | Runs snapshot Jobs and updates backup status without moving data through the controller. |
| Manual intervention gates | User-driven pause and maintenance settings. | Limit or reshape automation when an operator needs to intervene directly. |
- Rolling upgrades
- Blue-green upgrades
Rolling path
- Version drift triggers pre-upgrade validation around semver, health, and optional snapshot prerequisites.
- The upgrade manager uses StatefulSet partitioning and leader step-down to replace one pod at a time in reverse ordinal order.
- Progress is preserved in status so a failed step can stop cleanly and later resume from an explicit retry request.
- Completion updates currentVersion and clears the transient rolling-upgrade state once the workload fully converges.
Blue-green path
- A parallel green revision is created and joined as non-voters before any traffic cutover happens.
- Promotion, demotion, cleanup, and rollback all move through explicit phases stored in status.
- The Service selector changes only during cleanup, after a green leader is confirmed and blue peers are ready to leave.
- If rollback safety breaks down late, the manager enters break-glass instead of continuing risky automation blindly.
Reference table
Operational control surfaces
| Control | What it does | When to use it |
|---|---|---|
spec.paused=true | Short-circuits reconcilers so the operator stops mutating managed resources for the cluster. | Use when you need manual intervention and want automation to stop entirely. |
spec.maintenance.enabled=true | Keeps reconciliation running, but marks resources for controlled disruptive changes allowed by policy. | Use when the operator should continue known-safe automation during maintenance work. |
spec.breakGlassAck | Acknowledges an issued nonce before risky late-stage recovery automation can continue. | Use only after an operator has reviewed a break-glass condition and accepts the next step explicitly. |
Continue the lifecycle
You are reading docs for version 0.1.0. Use the version menu to switch to next or another archived release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.