Run Raft snapshots as stateless jobs and keep retention out of the data plane.
The backup manager owns scheduled and manual snapshot orchestration for OpenBaoCluster. It validates cluster readiness, acquires the operation lock, creates executor Jobs, and records backup state so backups stay auditable and resumable without embedding snapshot transport inside the controller.
At a glance
Control path
- adminops reconciler
- internal/app/openbaocluster/adminops
- internal/service/backup
Owns
- backup trigger detection for schedules and manual requests
- preflight validation and operation-lock ownership for backup
- retention evaluation after successful uploads
Writes
- backup executor Jobs and job annotations
- status.backup timing, success, and failure counters
- operation lock state while backup is in progress
Depends on
- cluster health and absence of conflicting upgrade or restore work
- spec.backup target, authentication, and executor image configuration
- object-storage reachability and trust configuration for the selected provider
Architectural Placement
Backup orchestration belongs to the AdminOps path:
internal/controller/openbaoclusterreceives an adminops reconcile event.- The controller delegates into
internal/app/openbaocluster/adminops. - AdminOps orchestration invokes
internal/service/backupto validate, launch, and observe backup execution.
That keeps the controller focused on reconcile plumbing while the backup manager owns timing, job launch, and retention decisions.
Reference table
Owned surfaces
| Surface | What the manager decides | Why it matters |
|---|---|---|
| Backup trigger window | Whether a cron window, manual trigger, or pre-upgrade request should launch a new Job. | Backups need at-most-once behavior per scheduled window and predictable manual overrides. |
| Executor Job | Job name, annotations, auth wiring, and provider-specific environment for the backup binary. | The controller should schedule work, not stream snapshot data itself. |
| status.backup | Attempt timing, next schedule, last success, and consecutive failure state. | Operators need backup visibility without inspecting transient Jobs. |
| Retention policy | Which completed backups can be deleted after a successful upload. | Retention belongs to the control plane so cleanup stays consistent across providers. |
Backup Flow
Diagram
Validate, launch, then record
The backup manager validates cluster state first, launches a stateless Job second, and only updates backup status after the Job reaches a terminal result.
Reference table
Preflight and status model
| Check | Manager behavior |
|---|---|
| Cluster readiness | Backup launches only when the cluster is in a stable running phase and the workload is not already mid-transition. |
| Conflicting operations | Restore and active upgrade state block backup launch; only one long-running operation may own the cluster lock at a time. |
| At-most-once scheduling | status.backup.lastAttemptScheduledTime and nextScheduledBackup prevent duplicate launches in the same cron window. |
| Failure accounting | Consecutive failures increase only when a terminal Job fails, not on every reconcile that notices the same failed Job. |
Provider And Retention Surfaces
Reference table
Provider integration surfaces
| Provider family | Auth patterns the manager supports | What stays the same |
|---|---|---|
| S3-compatible | Static access keys, explicit web identity, ambient workload identity, or ServiceAccount annotation-driven identity. | The manager still creates one executor Job and records status the same way after upload completes. |
| GCS | Service account key, Application Default Credentials, or Workload Identity metadata on the generated pod identity. | Upload and retention stay job-driven; only the credential wiring changes. |
| Azure Blob Storage | Account key, connection string, or managed identity/workload identity defaults. | Retention and backup naming stay provider-agnostic at the manager boundary. |
Backups are stored under a stable object prefix so restore workflows can locate artifacts without reverse-engineering Job names:
<pathPrefix>/<namespace>/<cluster>/<timestamp>-<short-uuid>.snap
Reference table
Safety boundaries
| Concern | Manager behavior |
|---|---|
| No data-plane coupling | The controller never handles snapshot bytes directly; the executor Job performs authentication, snapshot, and upload work. |
| Retention timing | Retention runs only after a successful upload so cleanup never removes older recovery points before a new one exists. |
| Upgrade coordination | Pre-upgrade snapshots reuse backup job machinery rather than creating a second snapshot implementation in the upgrade manager. |
| Local buffering risk | The backup path is designed around streaming to object storage rather than writing large transient snapshot files inside the controller. |
Related deep dives
You are reading docs for version 0.1.0. Use the version menu to switch to next or another archived release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.