OpenBaoRestore Controller (Restore Lifecycle)¶
Responsibility
Reconcile OpenBaoRestore resources and orchestrate snapshot restores via a Kubernetes Job.
User Guide
For operational instructions, see the Restore User Guide.
1. Design Philosophy¶
- CRD-Based: Restores are modeled as
OpenBaoRestoreobjects, not as a mode ofOpenBaoCluster. This ensures GitOps stability and provides an audit log of restore operations. - Stateless Controller: The controller polls the restore Job rather than watching it, minimizing RBAC requirements.
- Safety First: Restores use a distinct Operation Lock to prevent conflicts with Backups or Upgrades.
2. Restore Lifecycle¶
The controller drives the OpenBaoRestore through a defined phase machine.
graph TD
Start[User Creates OpenBaoRestore] --> Pending
Pending --> Validating{Validate}
Validating -- Invalid --> Failed[Phase: Failed]
Validating -- Valid --> Lock{Acquire Lock}
Lock -- Locked --> Pending
Lock -- Acquired --> Running[Phase: Running]
subgraph Execution [Restore Job]
Running --> Job[Launch Job]
Job --> Pull[Pull Snapshot]
Pull --> Restore[Restore to Vault]
end
Restore -- Success --> Completed[Phase: Completed]
Restore -- Error --> Retrying{Retry?}
Retrying -- Yes --> Job
Retrying -- No --> Failed
classDef phase fill:transparent,stroke:#9333ea,stroke-width:2px,color:#fff;
classDef terminal fill:transparent,stroke:#22c55e,stroke-width:2px,color:#fff;
classDef error fill:transparent,stroke:#dc2626,stroke-width:2px,color:#fff;
class Pending,Validating,Running,Execution phase;
class Completed terminal;
class Failed error;
3. Workflow Steps¶
- Validation:
- Target Cluster exists.
- Snapshot source is accessible.
- No conflicting operations (unless
spec.forceis used).
- Operation Lock:
- The controller sets
OpenBaoCluster.status.operationLock = "Restore". - This blocks the BackupManager and UpgradeManager from starting new operations.
- The controller sets
- Execution:
- A Kubernetes Job is spawned with the
bao-backupbinary in restore mode. - It downloads the snapshot from object storage (S3, GCS, or Azure).
- It uses a temporary token (or valid credentials) to authenticate and hit the
sys/storage/raft/snapshot-forceendpoint.
- A Kubernetes Job is spawned with the
- Completion:
- On success, the lock is released.
- The cluster may need to be unsealed manually or via auto-unseal.
4. Interaction with Other Managers¶
Conflict Prevention
The Operation Lock is the primary mechanism for safety.
- Backups: Will skip scheduled runs if a Restore is locked.
- Upgrades: Will pause reconciliation if a Restore is locked.
To override this check during emergencies (e.g., restoring a broken cluster where the lock is stuck), use spec.overrideOperationLock.