Kubernetes platform health

Use this runbook when a Kubernetes platform alert fires for an OpenBao workload. These alerts point to pod readiness, restarts, node pressure, PVC pressure, or collector scrape health.

Before you begin

Get access to Prometheus or the metrics backend that evaluates the alert.
Get access to the OpenBao Kubernetes platform dashboard.
Get access to Kubernetes for the affected OpenBao namespace.
Get access to OpenBao operational logs, audit logs, and change history.

Confirm the alert

Check which platform alert fired.

ALERTS{alertstate="firing", alertname=~"OpenBao(AllPodsUnavailable|AuditPVCPressure|KubernetesPodNotReady|KubernetesPodRestartsIncreasing|KubernetesNodePressure|LogCollectorTargetDown)"}

Open the OpenBao Kubernetes platform dashboard.
Confirm the dashboard variables match the affected namespace, pod, container, PVC, node, and scrape job.

Check whether OpenBao workload alerts fired at the same time.

ALERTS{alertstate="firing", alertname=~"OpenBao(SealedUnexpectedly|NoActiveNode|MultipleActiveNodes|Autopilot.*|Raft.*|Audit.*)"}

Investigate pod availability

Check OpenBao container readiness by pod.

min by (namespace, pod) (
  kube_pod_container_status_ready{pod=~"openbao.*",container=~"openbao|bao"}
)

Check pod phase and recent restarts.

max by (namespace, pod, phase) (
  kube_pod_status_phase{pod=~"openbao.*"}
)

sum by (namespace, pod) (
  increase(kube_pod_container_status_restarts_total{pod=~"openbao.*",container=~"openbao|bao"}[15m])
)

Inspect Kubernetes status for the affected pods.

kubectl -n <namespace> get pods -l app.kubernetes.io/name=openbao -o wide
kubectl -n <namespace> describe pod <pod>

Compare with OpenBao status and logs.

bao status -address=<openbao_address>

{log_stream="openbao.operational", namespace="<namespace>", pod="<pod>"}

Investigate PVC pressure

Check PVC free percentage.

100 *
min by (namespace, persistentvolumeclaim) (
  kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*(openbao|bao|audit|data).*"}
)
/
min by (namespace, persistentvolumeclaim) (
  kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*(openbao|bao|audit|data).*"}
)

Identify whether the affected PVC stores audit logs, Raft data, or another OpenBao path.
```
kubectl -n <namespace> get pvc
kubectl -n <namespace> describe pvc <pvc>
```

Check audit-device health and archive delivery before you rotate or move audit files.

openbao:audit_log_request_failure:increase5m
openbao:audit_log_response_failure:increase5m
openbao_audit_archive_delivery_success

Check whether the archive path is healthy enough to tolerate local cleanup.

time() - max(openbao_audit_archive_last_success_timestamp_seconds)
sum(increase(openbao_audit_archive_delivery_failures_total[15m]))
sum(increase(openbao_audit_archive_dead_letter_records_total[15m]))

Investigate node pressure

Check active node pressure conditions.

max by (node, condition) (
  kube_node_status_condition{condition=~"DiskPressure|MemoryPressure|PIDPressure",status="true"}
)

Locate affected OpenBao pods on pressured nodes.

kubectl -n <namespace> get pods -l app.kubernetes.io/name=openbao -o wide
kubectl describe node <node>

Check whether the affected node also hosts Prometheus, Loki, Alloy, storage, or other dependencies for the OpenBao observability path.
If node pressure coincides with Raft or storage symptoms, use the HA/Raft runbook before you restart multiple OpenBao pods.

Investigate collector health

Check collector scrape health.

up{job=~".*(alloy|grafana-alloy|promtail).*"}

Check OpenBao operational and audit stream presence.

{log_stream="openbao.operational"}

{log_stream="openbao.audit"}

Inspect collector pods and logs.

kubectl -n <collector_namespace> get pods
kubectl -n <collector_namespace> logs <collector_pod>

Check whether the audit archive path is affected.
```
openbao_audit_archive_delivery_success
```

Restore the baseline

If pods are unavailable because of a rollout, pause further rollout steps and restore one OpenBao pod at a time.
If OpenBao is sealed, leaderless, or Raft-unhealthy, use the OpenBao workload runbook before you change Kubernetes scheduling.
If a PVC is low, expand the volume or restore log rotation and archive delivery. Do not delete audit files until the security evidence path is approved.
If node pressure is active, move unrelated workloads, restore node capacity, or drain only after you confirm OpenBao quorum and failure tolerance.
If the collector is down, restore the collector, credentials, file mounts, and network path. Verify audit and operational streams after the collector recovers.

Verify the result

Confirm OpenBao workload containers are ready.

min by (namespace, pod) (
  kube_pod_container_status_ready{pod=~"openbao.*",container=~"openbao|bao"}
)

Confirm restart rate returns to zero.

sum by (namespace, pod) (
  increase(kube_pod_container_status_restarts_total{pod=~"openbao.*",container=~"openbao|bao"}[15m])
)

Confirm PVC free space is above the local threshold.

100 *
min(kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*(openbao|bao|audit|data).*"})
/
min(kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*(openbao|bao|audit|data).*"})

Confirm collector scrape targets are up.

up{job=~".*(alloy|grafana-alloy|promtail).*"}

Wait for the alert window to pass and confirm the alert resolves.

Troubleshooting

Metrics are empty

Confirm that kube-state-metrics, kubelet or cAdvisor metrics, and PVC metrics are scraped by the same metrics backend that evaluates the generated rules. Some Kubernetes distributions rename jobs, restrict kubelet metrics, or do not expose PVC volume stats for every storage driver.

The pod selector misses OpenBao pods

Adjust the dashboard variables and alert selector to match your operator or Helm labels. Keep selectors bounded to the OpenBao workload.

The collector target is up but logs are missing

Collector scrape health only proves that Prometheus can scrape the collector. Check collector pipeline errors, Loki write errors, file permissions, and audit device paths.

PVC pressure fires on a data PVC

Treat data PVC pressure as a storage and availability issue. Treat audit PVC pressure as both a security and availability issue because audit-device write failures can affect OpenBao requests.

What’s next

Use OpenBao Kubernetes platform dashboard to inspect platform context.
Use OpenBao Raft and Autopilot health when pod or node symptoms overlap with HA/Raft alerts.
Use Audit request and response failures when PVC or collector symptoms affect audit writes.
Use Audit archive degraded when durable archive delivery is stale, failing, or dead-lettering.
Use OpenBao Operator companion profile to keep operator control-plane alerts separate from OpenBao workload alerts.

Source: Kubernetes documents kube-state-metrics as an add-on for Kubernetes object state in the Kubernetes kube-state-metrics documentation . The kube-state-metrics project documents pod readiness and restart metrics in its pod metric reference , and node pressure condition metrics in its node metric reference . Kubernetes documents kubelet node, pod, container, and volume metrics in Node metrics data .