Runtime and storage warnings
Use this runbook when a runtime or storage warning fires for OpenBao. These alerts point to storage barrier, cache, runtime memory, or mount table changes that need correlation with request latency, logs, and recent deployment activity.
Before you begin
- Get access to Prometheus or the metrics backend that evaluates the alert.
- Get access to the OpenBao runtime and storage dashboard.
- Get access to OpenBao operational logs and audit logs.
- Get access to deployment, platform, and change history for the affected OpenBao cluster.
Confirm the warning
Check which warning fired.
ALERTS{alertname=~"OpenBao(StorageBarrierLatencyElevated|StorageCacheHitRatioLow|RuntimeMemoryGrowth|MountTableGrowth)", alertstate="firing"}Open the
OpenBao runtime and storagedashboard.Compare the warning time with request latency and token-check latency.
openbao:core_handle_request:avg5m openbao:core_check_token:avg5mCheck whether Raft, audit, token, or lease alerts fired at the same time.
ALERTS{alertstate="firing", alertname=~"OpenBao.*"}
Investigate barrier latency
Check barrier operation latency.
openbao:barrier_get:avg5m openbao:barrier_put:avg5m openbao:barrier_list:avg5m openbao:barrier_delete:avg5mCheck barrier operation rates.
openbao:barrier_get:rate5m openbao:barrier_put:rate5m openbao:barrier_list:rate5m openbao:barrier_delete:rate5mCheck HA/Raft health before you tune clients or storage.
openbao:autopilot_healthy:max openbao:autopilot_node_healthy:min openbao:raft_peers:maxSearch operational logs for storage and Raft symptoms.
{log_stream="openbao.operational"} |~ "(?i)(storage|barrier|raft|autopilot|timeout|slow|error|failed)"
Investigate cache changes
Check cache hit ratio and traffic volume together.
openbao:cache_hit_ratio:ratio5m openbao:cache_hit:rate5m openbao:cache_miss:rate5mCheck whether request mix changed around the same time.
sum by (request_path) ( count_over_time( {log_stream="openbao.audit"} | json request_path="request.path" [15m] ) )Check for new mounts, remounts, or policy changes in audit logs.
{log_stream="openbao.audit"} | json request_path="request.path" | request_path=~"sys/(mounts|auth).*"
Investigate runtime memory
Check runtime memory and heap signals.
openbao:runtime_alloc_bytes:max openbao:runtime_sys_bytes:max openbao:runtime_heap_objects:maxCheck garbage collection signals.
openbao:runtime_gc_pause_ns:avg5m openbao:runtime_total_gc_runs:max openbao:runtime_total_gc_pause_ns:maxCompare runtime growth with request throughput and operational logs.
openbao:core_handle_request:rate5m{log_stream="openbao.operational"} |~ "(?i)(runtime|memory|gc|heap|allocation|error|failed)"Check container or host memory metrics in your platform monitoring system. OpenBao runtime memory does not show the full container or node memory picture.
Investigate mount table growth
Check mount table entries by bounded labels.
openbao:core_mount_table_num_entries:maxCheck mount table size.
openbao:core_mount_table_size:maxInspect recent mount and auth method changes.
bao secrets list -detailed -address=<openbao_address> bao auth list -detailed -address=<openbao_address><openbao_address>: OpenBao API address for a reachable active node.
Check audit logs for mount and auth configuration changes.
{log_stream="openbao.audit"} | json request_path="request.path" | request_path=~"sys/(mounts|auth).*"
Restore the baseline
If latency correlates with Raft health, use the HA/Raft runbook before you change clients or secret engines.
If cache ratio changed because of a planned workload or mount change, record the expected baseline and alert duration in your change record.
If runtime memory grows with request latency or platform memory pressure, reduce the triggering workload or roll back the related deployment change.
If mount table growth is unplanned, identify the change owner before you disable mounts, auth methods, plugins, or policies.
If operational logs show storage backend errors, restore the storage backend or platform dependency first.
Verify the result
Confirm that request latency returns toward baseline.
openbao:core_handle_request:avg5mConfirm that the warning-specific signal returns toward baseline.
openbao:barrier_get:avg5m openbao:cache_hit_ratio:ratio5m openbao:runtime_sys_bytes:max sum(openbao:core_mount_table_num_entries:max)Confirm that operational logs no longer show correlated storage, cache, runtime, or mount errors.
{log_stream="openbao.operational"} |~ "(?i)(storage|barrier|cache|runtime|memory|mount)" |~ "(?i)(error|failed|timeout)"Wait for the alert window to pass and confirm that the warning resolves.
Troubleshooting
The alert fires after a planned change
Record the new baseline and expected duration. Silence the alert only for the approved change window.
Metrics are empty
Confirm that generated recording rules are loaded and that Prometheus scrapes
OpenBao source metrics with the expected vault_* or openbao_* prefix.
Runtime memory grows but platform memory is stable
Check whether Go runtime memory has stabilized at a higher allocation target. Use platform memory, request latency, and GC pause together before you treat the warning as a leak.
Cache hit ratio is low in a quiet cluster
The alert requires active cache traffic. If it fires in a quiet cluster, check recording rule freshness and scrape timestamps.
What’s next
- Use OpenBao runtime and storage dashboard to inspect the correlated signals.
- Use OpenBao HA/Raft dashboard when storage symptoms correlate with Raft health.
- Use OpenBao secret engines and mounts dashboard when mount table growth needs audit context.
- Use OpenBao Raft and Autopilot health if a Raft alert fires with the storage warning.
Source: OpenBao documents telemetry metric behavior in the OpenBao telemetry metrics overview . OpenBao documents runtime, barrier, cache, and mount table metric names in the OpenBao telemetry metrics reference .