Run planned maintenance safely
Planned maintenance is where Kubernetes disruption rules, Raft quorum, and admission-policy guardrails meet. Use this page to prepare drains and restarts, scale safely, and confirm the cluster returns to normal operation.
Decision matrix
Choose the right control
| Control | Use it when | Operator behavior | Watch for |
|---|---|---|---|
| Node drain with PDB protection | You are changing the Kubernetes substrate and want the normal workload to keep serving. | The PodDisruptionBudget blocks voluntary disruption that would take too many voters offline at once. | The PDB only protects against voluntary evictions, not hard node failures. |
| Replica scaling | You need more capacity, stronger fault tolerance, or a deliberate reduction after a change. | The operator grows or shrinks the StatefulSet and manages peer membership as the replica count changes. | Do not treat scale-down as a harmless cost-saving action on a production Raft cluster. |
| Maintenance mode | Admission policy requires the openbao.org/maintenance=true signal before restarts or controlled deletes. | The operator annotates managed resources so callers with maintenance permission on the owning OpenBaoCluster can perform planned restarts or deletes. | Grant the custom maintenance verb on the owning OpenBaoCluster before using this path. |
| Pause reconciliation | You need a short-lived window where the operator stops mutating the cluster while you inspect or repair it. | The operator stops normal reconciliation until you resume it. | Pausing stops normal reconciliation, but safe-mode incidents still require the dedicated recovery flow. |
Drain nodes without breaking quorum
For clusters with three or more replicas, the operator creates a PodDisruptionBudget with maxUnavailable: 1. That is the main guardrail that keeps a normal node drain from evicting too many Pods at once.
Reference table
Pod disruption behavior by replica count
| Replicas | PDB created | What it means |
|---|---|---|
| 1 | No | There is no redundancy. Any disruption takes the service down. |
| 2 | No | A two-node Raft cluster cannot tolerate one unavailable voter cleanly enough for a safe maxUnavailable: 1 policy. |
| 3 | Yes | The PDB keeps two voters available while one Pod is evicted. |
| 5 | Yes | The operator still uses a conservative one-at-a-time disruption model. |
Verify
Check the disruption budget before a drain
kubectl get pdb -n <namespace>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
If more than one OpenBao Pod is concentrated on the same node, the drain may take longer because Kubernetes has to evict the Pods sequentially.
Node drains, autoscaler evictions, and direct eviction API calls are guarded. Node crashes, OOM kills, or kernel failures are not. Those rely on normal Raft quorum behavior instead of disruption-budget enforcement.
Scale the cluster deliberately
Use scaling as an intentional operational change, not a quick patch to quiet a temporary issue.
- Scale up
- Scale down
Configure
Increase the replica count
spec:
replicas: 5
The operator creates the new Pods, waits for them to join the Raft cluster, and updates the PDB to match the new size.
Configure
Reduce the replica count
spec:
replicas: 3
The operator removes voters from the highest ordinal first, waits for the Raft configuration to converge, and only then deletes the excess Pods.
Reducing a production cluster below three replicas removes the redundancy that makes ordinary node and Pod failures survivable.
Use maintenance mode for controlled restarts
Enable maintenance mode when your admission policies require a deliberate maintenance signal before managed Pods or the StatefulSet can be restarted, deleted, or otherwise touched during planned work.
Configure
Enable maintenance mode
spec:
maintenance:
enabled: true
In this mode, the operator annotates managed Pods and the StatefulSet with openbao.org/maintenance=true. Callers still need normal Kubernetes RBAC on the target resource plus the custom maintenance verb on the owning OpenBaoCluster.
This mode is also required for some day 2 changes that need a controlled restart path, such as finishing filesystem expansion after increasing spec.storage.size.
Trigger a rolling restart
Use spec.runtime.restartAt when you need the workload to roll because an external dependency changed, such as a certificate chain, secret material, or another input that should force a controlled refresh.
Configure
Request a rolling restart
spec:
runtime:
restartAt: "2026-01-19T00:00:00Z"
This request is independent from maintenance authorization. Set maintenance only when you need disruptive work on managed resources or an operator flow that explicitly requires the maintenance gate.
Use spec.runtime.restartAt for new configurations. The older spec.maintenance.restartAt path remains temporarily for compatibility.
When a leader Pod must be restarted or evicted, the operator handles graceful step-down automatically before termination so the cluster can elect a new leader cleanly.
Verify the cluster before and after the window
Verify
Inspect health before and after maintenance
kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.phase}{"\n"}'
kubectl get pods -n <namespace> -l openbao.org/cluster=<name>
kubectl exec -n <namespace> -it <pod-name> -- bao operator raft list-peers
The important end state is a clean phase, Ready Pods, and a Raft peer set that matches the intended topology after the maintenance action finishes.
External references
Move to the next control
You are reading the unreleased main docs. Use the version menu for the newest published release, or check the release notes for what is already out.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.