Cluster Maintenance¶
This guide covers maintenance operations for OpenBao clusters, including node drains, voluntary disruptions, and cluster scaling.
Pod Disruption Budgets¶
The Operator automatically creates a PodDisruptionBudget (PDB) for each cluster with 2 or more replicas. This protects against accidental quorum loss during voluntary disruptions.
Default Behavior¶
| Replicas | PDB Created | Max Unavailable | Notes |
|---|---|---|---|
| 1 | No | N/A | Single-replica clusters have no redundancy |
| 3 | Yes | 1 | Ensures quorum (⅔) is always maintained |
| 5 | Yes | 1 | Conservative setting; 4 pods remain available |
The PDB uses maxUnavailable: 1, meaning Kubernetes will block eviction requests that would take more than one pod offline simultaneously.
During Node Drains¶
When you drain a node hosting OpenBao pods:
- Kubernetes checks the PDB before evicting any pods.
- If eviction would violate the PDB, the drain blocks and waits.
- Pods are evicted one at a time, allowing the cluster to maintain quorum.
# Safe: Node drain respects PDB
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# Check if PDB is blocking eviction
kubectl get pdb -n <namespace>
Drain Timeouts
If multiple OpenBao pods are scheduled on the same node, draining that node may take longer as pods are evicted sequentially. Consider using pod anti-affinity to spread pods across nodes.
Limitations¶
The PDB only protects against voluntary disruptions:
- Node drains (
kubectl drain) - Cluster autoscaler scale-down
- Pod eviction API calls
It does not protect against:
- Node failures/crashes
- Pod OOM kills
- Node kernel panics
For these scenarios, rely on Raft's built-in quorum-based replication.
Scaling¶
Scaling Up¶
Increase replicas to add capacity or improve fault tolerance:
The Operator will:
- Create new pods via the StatefulSet
- Wait for each pod to join the Raft cluster
- Update the PDB to match the new replica count
Scaling Down¶
Reduce replicas to save resources:
Minimum Replicas
Never scale below 3 replicas in production. Scaling to 1 replica means any pod failure results in complete unavailability.
The Operator will:
- Gracefully remove Raft voters starting from the highest ordinal
- Wait for Raft configuration to update
- Delete the excess pods
- Update the PDB
Leader Step-Down¶
When maintenance requires the active leader to be evicted, the Operator automatically triggers a graceful step-down:
- Pre-eviction hook detects the leader is being terminated
- Step-down request is sent to
/sys/step-down - New leader election completes before the old leader terminates
- Eviction proceeds without service interruption
This behavior is automatic and requires no manual intervention.
Maintenance Windows¶
For planned maintenance, consider:
-
Pause reconciliation during complex operations:
-
Enable maintenance mode when your cluster enforces managed-resource mutation locks:
When enabled, the operator annotates managed Pods/StatefulSet with openbao.org/maintenance=true
to support controlled deletes/restarts under strict admission policies.
-
Trigger a rolling restart (for example, after rotating external dependencies):
-
Monitor cluster health before and after:
-
Check Raft peer status if needed: