Cluster Maintenance¶

This guide covers maintenance operations for OpenBao clusters, including node drains, voluntary disruptions, and cluster scaling.

Pod Disruption Budgets¶

The Operator automatically creates a PodDisruptionBudget (PDB) for each cluster with 2 or more replicas. This protects against accidental quorum loss during voluntary disruptions.

Default Behavior¶

Replicas	PDB Created	Max Unavailable	Notes
1	No	N/A	Single-replica clusters have no redundancy
3	Yes	1	Ensures quorum (⅔) is always maintained
5	Yes	1	Conservative setting; 4 pods remain available

The PDB uses maxUnavailable: 1, meaning Kubernetes will block eviction requests that would take more than one pod offline simultaneously.

During Node Drains¶

When you drain a node hosting OpenBao pods:

Kubernetes checks the PDB before evicting any pods.
If eviction would violate the PDB, the drain blocks and waits.
Pods are evicted one at a time, allowing the cluster to maintain quorum.

# Safe: Node drain respects PDB
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Check if PDB is blocking eviction
kubectl get pdb -n <namespace>

Drain Timeouts

If multiple OpenBao pods are scheduled on the same node, draining that node may take longer as pods are evicted sequentially. Consider using pod anti-affinity to spread pods across nodes.

Limitations¶

The PDB only protects against voluntary disruptions:

Node drains (kubectl drain)
Cluster autoscaler scale-down
Pod eviction API calls

It does not protect against:

Node failures/crashes
Pod OOM kills
Node kernel panics

For these scenarios, rely on Raft's built-in quorum-based replication.

Scaling¶

Scaling Up¶

Increase replicas to add capacity or improve fault tolerance:

spec:
  replicas: 5  # Was 3

The Operator will:

Create new pods via the StatefulSet
Wait for each pod to join the Raft cluster
Update the PDB to match the new replica count

Scaling Down¶

Reduce replicas to save resources:

spec:
  replicas: 3  # Was 5

Minimum Replicas

Never scale below 3 replicas in production. Scaling to 1 replica means any pod failure results in complete unavailability.

The Operator will:

Gracefully remove Raft voters starting from the highest ordinal
Wait for Raft configuration to update
Delete the excess pods
Update the PDB

Leader Step-Down¶

When maintenance requires the active leader to be evicted, the Operator automatically triggers a graceful step-down:

Pre-eviction hook detects the leader is being terminated
Step-down request is sent to /sys/step-down
New leader election completes before the old leader terminates
Eviction proceeds without service interruption

This behavior is automatic and requires no manual intervention.

Maintenance Windows¶

For planned maintenance, consider:

Pause reconciliation during complex operations:
```
spec:
  paused: true
```
Enable maintenance mode when your cluster enforces managed-resource mutation locks:
```
spec:
  maintenance:
    enabled: true
```

When enabled, the operator annotates managed Pods/StatefulSet with openbao.org/maintenance=true to support controlled deletes/restarts under strict admission policies.

Trigger a rolling restart (for example, after rotating external dependencies):
```
spec:
  maintenance:
    restartAt: "2026-01-19T00:00:00Z"
```

Monitor cluster health before and after:

kubectl get openbaocluster <name> -o jsonpath='{.status.phase}'
kubectl get pods -l openbao.org/cluster=<name>

Check Raft peer status if needed:

kubectl exec -it <pod-name> -- bao operator raft list-peers